ID 20€ 721 

lOTROR 
TlTtB 

MSTITOTIOH 
SPOIS iGERCr 
POB DITB 
ORARf 
ROTB 

BDRS PRICB 
DBSCRIPtORS 



OOCOIBIT RB80RB 

TH BIO 629 

•ard» Jaiu 6.: Gonld, Je'v«ll c« 
puis Talk &boat sttadacdisad Testa. Bsseacch 
Report. 

laerlcaa FtdHt^tion of Teachers, Hashiagtoa, O.C. 
Rational Inst, of Bdoeatioa <BD| , Rashi&gton, O.C. 
Oct 80 

RlB-G-7O-00«1 
93p. 

HPO1/PC0« Plai Postage. » g 
ichieveseat Tests: iptltadc Tests: Criterion ^ 
Referenced Tests: Elesentarr Secoadary Bducation: 
Biaisas Cospetency Testing: Hors Referenced Tests: 
Scoring Porsnlas; *standardized lasts: *Te8t 
Interpretation: Test ReliabilitTrr«Test Selection: 
*Test Ose: Test falidity 

IBSTRICT 

This hanS^ok, in two parts, constitutes a sanaal 
prepared by the iserican Federation of Teachers, foe isproting 
teachers* ase of standardised tests. Part I oatlines basic concepts 
and issnes snrronnding standardised tepting for teachers, parents and 
school adeinistr^tors. The t*rss nors- referenced tests, criterion 
referenced tests, sinijins cospetency tests, achiefeseat and aptitude 
tests are defined and explained, \-ken followed by a section regarding 
test selection. In which the aspects of test validity and reliability 
are introduced. The next -eha-pter, concerned with test interpretation, 
dlscasses how scores and various types of derived scores cossonly 
veed to report test resnlts, how they are derived, and cautions to be 
considered in their use. applications of standardised tests to 
instrectional planning, placeseat decisions, diagnosis of student 
needs, and the svaluation of instrnctioaal progress are also 
diseessed. Finally, basic presises contributing to the proper use of 
tests are reviewed, appendixes include lists of available tests, test 
publishers, and reference saterials which review tests. Part 2 
presents a hypothetical school distvict and two exercises in test 
selection, score analysis and presentation of ease to interested 
parties. (IBF) 



**e* 



Reprednetions supplied by BDRS are the best that can be sade 
* froB the original docusent. 

see* ee**e***«*«****«e«««««««««««««««Mink***tttttttttttttttttttt***** ****** 



rvi 

o 

rsj 



\4 



(5 - 



U S OEPARTMEMT OF EDUCATION 

NATIONAL INSTITUTE Of EDUCATION 

HJUCATlONAl HfSOURCtb iNfORMAnON 

Thr*. cJiM ijMHMH h.ts Iwt'fi 'I'prii.Kji imI as 
ofMj.'MJ.rKj .J 

M<n()( ih.iiuj»«, tiivi .Hifi 'ut>» t, .M){(,ivi 

• Pfimt', of vw'A ill npiiMifi , »>t hihJ f.'hisiiiMu 
miMit i|<) nitf run I svif'ly fn)'»*H<*"t <.Mu i UNlt 



PUUNTAIK 




\ 



nm 



BY 

DIRECTOR OF RESEARCH 
ASSISTANT DIRECTOR OF RESEARCH 



li 



-ERJC 



'Mill 



f 



A REPORT OF THE 



PMIN TALK 
ABOUT STANDARDIZED TESTS 



Prepared by the American Federation of Teachers 
Department of Research 

James G. Ward, Director 
Jewell C. Gould, Assistant Director 



NIE GRi^T/NIE-G-79-0041 



/ 



This study was prepared by the Research 
Department of the American Federation of 
Teachers under Grant Number NIE^G-79-0041 
from the National Institute of Education, 
U.S* Department of Education. The opinions 
expressed In this study do not necessarily 
reflect the position or policy of the 
National Institute of Education or the 
U.S. Department of Education. 



r 



TABLE OF CONTENTS 

0 

Pro tixcv 

I . Testing TDday . _ , i 

II. Standardized Tests 4 

III. Selecting Standardized Tests to Suit " 
: Your Purposes; Standards for 
Evaluation - ;* 17 

IV. Interpreting Test RjBsults 35 

V. Application of Standardized Tests 

to Your Purposes 46 

Bibliography 

Appendix A- 14 



Thla manual was prepared by the Research Department ol the 
American Federation of Teachers in collaboration with tjip 
^J«nrtm--ft«--the-Sttt4y^^- JIviilixat IoilXCSE) . Jf e^^ra t o f u 1 1 y 
nckmmlodKo the nxpcrt asBlstance of Anne ColdblaTt anti ~ 
llolon Nomorin in the preparation and typing of thi« manual. 



opelu#2aflclo 



5 



PREFACE 



The proper use of tests is a topic of interest to all teachers. 
Tests are an important tool for teaching and learning and theii 
appropriate use is critical to the educational process. 

A number of times throughout the 1970*s the Executive Council 
of the American Federation of Teachers adopted resolutions calling 
for more study of testing, more responsible use of tests, the 
improvement of testing processes, and the wider dissemination of 
information on tests. 

In 1978, the APT applied for and received a two year grant 
from the National Institute of Education to prepare training 
materials and conduct conferences on improving teachers' use of 
standardized tests. Part of the plan was to survey a representative 

? ample of teachers to ascertain their preparation and knowledge in 
&sting, their assessment of the importance of testing to their 
teaching, and their attitudes toward vajrious issues in testing. 
The results of that survey were publisHed in a report entitled,. . 
"Teachers and Testing: A Survey of Knowledge and Attitudes, by 
James G. Ward (Washington, D.C. : American Federation of Teachfe<;s. 
July 1980). 

This handbook, Plain Talk About Standardized Tests, is the 
basic manual* for the training conferences. It is intended to be 
a primex' on the issues in standardized tests for teachers and 
others who need a basic knowledge of the topJLc. The material on 
which a number of the chapters are based was a technical draft 
prepared by the Center for the Study of Evaluation, University of 
California at Los Angeles. Substantial revisioYis were made in 
those chapters based on comments and critiques from previous 
conference participants and outside reviewers, and from decisions 
made by the authors. 

This handbook represents one part of the commitment of the 
American Federation of Teachers to thQ professional growth of its 
members. Through AFT training conferences, QuEST conferences, 
and other educational services of the AFT, these materials will 
contribute to greater understanding of all tests and the better 
education of children. 



Chapter I 



TKSTING TODAY 



Testing occupies a central position in American education. 
In one form of another, tests are accepted by almost all involved 
in education as part of that process. It is part of the rational 
method in Western thought that one determines what one wants to 
achieve, one tries to achieve it, and then one has some process by 
which to measure whether it has been achieved and to what extent. 
Therefore, some kind of testing or assessment is necessary to 
complete the process. American eduMition htfs predicated much of 
its practice on this model. 

It is difficult to escape tests in American schools. Tests 
other than teacher prepared tests are increasing in number and 
visibility. 

o Between 40 and 45 states conduct state assessment programs. 

o Almost 40 states have adopted minimum competency testing 
programs . ^ 

o Over 90 percent of ^ local school districts regularly 
administer standardized norm-referenced tests to their 
students. 

o Many federal education programs require regular, formal 
testing of children for program entry, program exirst, or 
for program evaluation. . 

o Th3 public is demanding more accountability from publi<^ 
schools and see tests as a vehicle to determine program 
c success or failure. 

Testing is an integral part of schools and is likely to remain so 
in the foreseeable future. 

Yet. testing is controversial. While everyday educatoVs are 
using test results to improve « instruction and educational decision- 
making, critics of testing decry the essential meaningless of test 
scores and accuse tests of destroying children. For example, onq 
such attack on testing shows a photograph of a sad, little six 
year old girl, with a tear on her cheek, who has Just taken a teat. 
The accompanying, test explains that this child has Just scored 
below average on' a standardized test and, thus, at the tender age 
of Six has been destroyed educationally for life. The message of 
this one page advertisement is clear. Because this child has 
s cored be low average on one test, there is nothing teachers or 
schools could possibly do to help this child. Such anti-education 
diatribes miss completely the role of tests in teaching and learning 



It Is sometimes forgotten that tests are topis and, as such, can 
be used or abused. 

Tests are oAe of the tools of the professional teacher. From 
test scores inferences about students are made. To the extent 
that teachers understand the limits of the test instrument and 
apply the test results within those limits, tests are helpful. It 
is when those limits are not known, or heeded, that tests become 
abused . 

Understanding tests and their limits is not an easy task. 
The understanding is within the ability of professionals to 
grasp, provided some training is .provided and sensitivity to the 
process remains present. Test construction, properly done, is a 
complex and often technical process. This manual can help' 
teachers become aware of the demands of the process without re- 
quiring the ability to produce a test using sophisticated psycho- 
metric techniques. It can also help in the selection of the- ^ 
appropriate test for the various pujrposes teachers are testing. 
A methodical procedure for selection is outlined, and will prove 
helpful to teachers while inspiring confidence in the later use 
or the test. This promotes efficiency in the iong run, and helps 
eliminate charges of too much time for testing compared to the 
use of test results. 

Eventual integration of test results into the teaching' pro-, 
cess can be encouraged to the benefit of students, teachers and 
those responsible for policy decisions at che district level 
It does not need to be abused. Testing, like a student, needs 
an appropriate mix of understanding and discipline. It must be 
encouraged to attend to the task at hand if anything of value Is 
to develop 

Plain Ta lk About Standardized Tests is intended to help 
teachers, parents, school administrators, and others concerned 
with successful learning to better understand the area of educa- 
tional testing by' focusing on one particular kind of test— the 
standardized test. This testing primer will help you understand 
what standardized tests are, how to select appropriate standar- 
dized tests to suit particular testing purposes, how to evaluate ' 
tests which are required to be administered in schools how to - 
interpret and use test results. ^ 

The underlying belief in this handbook is that tests are one 
of the many sources of information which may be used to improve 
edMcation. but that tests alone do not provide enough information 
fp decision-making. Tests are important, but they should never 
\t / used alone. 

i 

^ The remaining chaptws in this handbook explore th*: issues 

in the use of standardized tests. Topics include: 



- 2 - 



/ 

/ 



An overview of what standardized tests are, how they are 
developed, and what their uses and limitatiorts are. ^ 



• A review of what some- of the important factors are in 
selecting suitable tests or in evaluating existing tests, 
and how to apply those factors to test selection. 

• An explanation of commonly used standardized test^ scores, 
how scores are derived, how scores are interpreted, and 
what cautions should be con^^Mered in score interpretation. 

• A discussion of using the^esults of standardized tests in 
instructional planning with suggestions and^^rocedures for 
using test results to make placement decisions, to diat^noso 
student needs, and to evaluate instructional programs. 

• A review of the basic premises which contribute to the wise 
and proper use of tests, and which will help prevent their 
misuse. 

Various purveys and research studies, including ^the 1^79 AFT Survey 
of Teachers' Knowledge and Attitudes on Testing, have shown a wide 
variation among teachers 6f background in testi\ng and knowledge 
about testing. Therefore, this handbook takes a comprehensive 
approach including both basic definitions and concepts and more 
advanced topics on tjesting. A more experienced reader may want 
to skim certain sections of the handbook and concentrate on 
selected sections. A person new to the subject will want to 
take a more deliberate Journey through the text. 



•Chapter II ^ 

STANDARDIZED TESTS 
c 

Some Introductory Comments 



Everyone has taken tests. Tests are so much a part of our 
, common educational experience that we often fail to look at some 
of thp basic and underlying principles of tests and testing. This 
Chapter begins with some basic^concepts and- ideas and pro<?eedK 
through an examination cJf standardized test$, including their \ 
development and characteristics. 

* r 

> What is a Test? 

», f test is a systematic means of observing. and de'scrij&irg 
Debavior it usually consists of a presentation of a standard set 
ox questions to be-answered. The awswers to the questions are 
measured against a standard and a numerical value is assigned which 
18 a description of the observed bfehavior. This "score" is inter- 
preted as a measure of a characteristic of the person taking the 
xesx . 

^5®f® probably as many varieties of tests as there are 
those doing the testing. Tests can range from a conscious observa- 
tion of study skills during an in-class work period to a formal 
admissions examination for a specialized graduate school. Tests 
may be oral or written. They may be impromptu, or involve years" of 
aevelopment time. Almost every act a teacher completes to assess a 
student's performance could be called a test. 
' ^- 

A test majr be designed for indiyidual or g^up use. it may 
measure achievement, aptitude, attituides, personality traits, or 
psychomotor behavior.. A test is on^ source of information for 
educational decision-making. 



Why Test? 



V 



Tests can be used for a wide variety of reasons. Although 
some reasons are highlighted below^ one could think of others 
that have been excluded. Test results can b«^used to: 

o Diagnose academic and behavior strengths and 
weaknesses. 

V o Prescribe specific educational plans for indi- 
viduals and groups of learners. 

o Place students. in special accelerated or remedial 
classes and programs. * 



ERIC 



10 



-•4 - 



o Determine student achievemeni 
d Evaluate program ef fe^tivenesa. 



o Select students possessing part icul^ar/abilities 
and/or aptitudes^ ^ ' 

o Certify student, com^tence. 

-vo Promote teacher and system accountability . 

. ^ o Predict future student behavior. ^ 

o Inform students about the development of certltin 
abilities. ^ ' y 

It is impprtaftt to know why one wants to test before one can 
approach feuch issues as test design, test content, format for 
reporting test results, and other more specialized questions. 

yhy^ Do Teachers Need To Know More About S tandardized Tests? 

There are* many reasons why knowledge on standardized tests is 
becoming. Increasingly more important. 

First, tests are an integral part of the education process. 
If tests are not properly used, time and money spent on tests are 
wasted. Good testing is critical for effective learning. If \ou 
have no «(^y bf assessing effectiveness of the teaching effort, then 
it is extremely difficult to improve that effort'. StandardlzeciT 
tests are one kind of the tests that are almost universally used in 
schools to make those kinds of assessments.- 

Secondly, there is growing public concern about accountability 
in education. For example, teachers in some places are evaluated 
on the basis of their students' ac£idemic performance and mastery 
of established edCi<;ational objectives. While this practice has 
many limitations, the teacher's role and responsibility in the 
education of each learner is being emphasized across the country. 
Some people feel that standardized tests. can be one way to approach 
accountability. These tests, then, may have important consequences 
for teachers, students, and the educational system as a whole. 
While tests are no^ the only method of accountability we have, they 
do help to m|iintaiq educational standards. 

Third, federal and state supported programs often require audit 
monitoring, and final evaluation reports not only to determine the 
effectiveness of iiq>lemented subsidized programis, but tp make finan- 
cial decisions about the continuance or termination of federal and 



11 



state aid for the future. The government and the taxpayer want' 
to know how, where, and why our dollars are being spent in the , ' 

educat^iorial arena to educate today's children^. Again, standard- 
ized tests often have real consequences for ^achers and programs 
in trying to answer such questions. » 

e 

Also, there are increasing legal challenges to current ?nothods 
of evaluation. Some of ^the more recent discussion involves the * 
i administration of standardized tests to minority children, parti- 
cularly black and Hispanic children, the placement of dispropor- 
tionate numbers of minority students in classes for the educably 
mentally retarded, the use of results from minimum competency 
examinations to determine whether a high school diploma should )p 
granted, the perceived decline oi basic skills at both the elemen- 
tary and secondary levels, and identification and assessment of 
language proficiency for children of non-English backgrounds. 

c 

In various regions of the country,^ litigation and legislation ^ 
involving these issues have lead to efforts to abolish or restrain 
the use of standardized tests. Genuinely intent ioned educators , 
parents, and politicians jare requesting a reexamination of the 
role of tests aiftl testing results in education. ^ Teachers are, and 

should be, very much involved in the consideration of these issues. 

/ 

Standardize^ Tests: What Al-e They? 

Standajirdized tests, like all tests, attempt to provide teachers 
with Inforrqation about their students. What makes a standardized 
test different from a classroom test, for example, is that tho test 
items are preselected, and administration and scoring procedures ^ 
are prescribed, or standardized, for all students taking the test. 
Also, they usually provide information about how others who took 
the test performed. The purpose for standardization of content, 
administration, scoring and interpretation becomes immediptely 
apparent when one considers that such tests are often used to gather 
descriptive or comparative information from large groups of students. 
When comparing students from more than one classroom, /fariabi 1 1 ty 
among test conditions seriously hampers the ability to say anything 
at all about the students tested. An example will make this point 
even clearer. 

Suppose that a school district wishes to determine hov well 
kxll of the fifth grade students in the district jread. The district 
consists of a number of elementary schools. One school is very 
crowded and students attend class in portable trailers parked next 
to the baseball field. Aniother is located near the airport and 
instruction must stop each time a jumbo-jet lands. In a third school, 
fifth graders are assigned* to carpeted rooms with computer terminals 



mc 



- 6 



12 



for computer assisted InstructloD. It may be easy to predict which 
students have the best chance of performing successfully on the 
reading achievement test. While educational opportunity cannot be 
totally equalized in this district, the differences in testing 
conditions amoflg students can be minimized and each student's 
chance of doing well on the test can be maximized by using standard- 
ized content, procedures, md scoring rules. Tht subsequent 
aifferences among students can thpn be ascribed :o learning environ- 
ment, or to differences in achievement and ability, rather than to 
differences in test climate. In order to have the kind of informa- 
Ju?? needed for Instructional planning, as many lactors as possible 

interfere with our ability to attribute test performance to 
student learning must be eliminated. Comparability of scores and 
testing conditions is an important factor. 

Standardized tests, then, provide for the sampling of behavior 
under a det of uniform procedures. There is a common set of 
questions that are administered with the same set of directions 
and time limitation to students, with a uniform scoring procedure. 

Standardized tests, like most other tests, are produced in four 
basic steps. One source cites the following steps as central to the 
construction of b standardized test: » «» ^^ni,ru± 

r 

• Planning the test. 

• Preparing the test items. 

• Experimental tr^^out and re ision. 

• Administering the standardization edition. 

The initial step in test construction is the deification of the 
content and skills the test will cover. Test developers usually 
consult with teachers, curriculum experts, and book publishers to 
determine what is being taught and what .is important. Next they 
otttlihe the topics and skills they plan to assess, as well as the 
number of items that will be used for each area. The developers 
detailed xjontent specifications which are usually pro- 
vided in the test manual. This provides the user with some basis 
for determining how well the test content fits particular curri- 
culum goals. *^ . 

^# ♦u'"*®" content has been specif led, it Is fixed for all forms 
or The 8 andardized test. Because most standardized tests are 
designed to reach a broad market, the content and skills selected 
!f!*!??f"!°^***''®' °' in a large number of school 

til liSlLf**; «;o«r»Phlcally separate regions. For this reason, 
tbe content of standardized tests often may not match perfectly 



13 

7 - 



local curricular goals or individual classroom learning objectives. 
However, they do provide a good basis for comparing students from 
differ nt schools or areas. 

Test items are written to conform to the c(;ntent specifications 
created. Test developers produce many potential items for each 
skill area, then test the items with groups of students, and finally 
select the "best" items acco.rding to some prespecif ied criteria. 
During the test development process many items are written, tested, 
revised and discarded before the fi.nal form of the test is complete. 

Standardised tMt developers give their draft test items to 
groups of pupils selected to represent the population for whom the 
test intended. The purpose of this try-out is to identify and 
refine the test items for final test forms and to test the adequacy 
of test directions, time limits, and format. 

Prom the item statistics and other information generated, test 
developers select the best items and construct final forms of the 
test. These final forms are tested once again with groups of 
students representing the population with whom the test is to be 
used. This final tryout is designed to gather information about the 
quality of the test Itself rather than individual items. The actual 
equivalence of test forms in term? of producing comparable group 
performance is determined at this stage. Reliability and validity 
statistics are computed ifrom tryout results. Group performance on 
the standardization tryout is often used to construct tables for 
interpreting test scores. Also the adequacy of administration and 
scoring procedures is checked out once again. Standardization try- 
out results are written up in the test technical manual and provide 
important information for test consumers about the quality of the 
test and the limitations on score interpretation. 

While ^Ms brief descript^ n rf how a standardized test might 
be^ developea is oversimpllf iec!, ^oes provide a idea of the com- 
plexity of the test development process. Central to this process 
are three fundamental question^ which must be kept in mind whenever 
testing is under consideration » These three questions are: 

o Exactly why are you giving (developing) the test? 

o What type of information do you expect from the test? . 

o How do you intend to use this information once you have' 
it? 

As you read through this manual, do not ever lose sight of these 
three questions. 



14 



standardized tests can be grouped into three classif icat Iohr 
based on what they measure. The three general categories of stand- 
ardized tests are: 

o Aptitude tests. 

o Achievement tests. 

o Personality, attitude, and interest inventories. 

Each one is distinguished by its purpose and the kind of Information 
it gathers. f 4 

Aptitude tests attempt to assess students* abiltties and poten- 
tial. Their purpose is to predict future performance. While some 
skills measured by aptitude tests may be learned, others are develop^ 
mental. Although aptitude tests are not as dependent on school 
learning as achievement tests, it would be impossible to construct 
an aptitude test that did not measure school learning to some extc^nt 
Aptitude tests are used for administrative decisions such as student 
selection, classification, and placement , for guidance purposes, and 
sometimes for evaluation of instruction. 

Achievement tests measure a student's prior learning and 
developed abilities specific content areas. Their purpose is to 
assess how much a student ha^ learned in school. Achievement tests 
are used in instructional evaluation, in guidance,^ and for adminis- 
trative decisions such as student selection, classification, and 
placement. <r 

Personality, attitude, end interest inventories are non-ccgni- 
tlve measures and are used primarily for guidance purposes. 

Norm-and Criterion-Referenced Tests 

Standardized tests, particularly standardized achievement tests, 
are often described as being norm-references or criterion-referenced. 
The distinction between these two is very important, although in 
practice it often becon^s Murred. 

The fundamental difference between these two types of tests 
rests with how the stores are reported and interpreted. Norm- 
referencing means reporting scores so that one can tell how a student 
•core conpares to the scores of others (the norm group). Criterion- 
referencing means reporting scores so that one can tell how a student 
•core or performance compares to some specified standard of prof i- 
clency. 

is 



Norm- referenced tests were designed to make selection and 
classification decisions where it is Important to assess one 
persons 's performance in comparison to the performance of others. 
Criterion-referenced tests were designed to describe a person's 
mastery of specific content in relation to a selected standard. 

It should be obvious that norm--referenced tests, since they 
are based on stated content objectives, provide information on 
mastery of specific content ai^ thai; criterion-referenced tests 
can be used to compare student performances. Hence, the distinc- 
tion between the two is not always as clear and distinct as the 
definitions imply. 

Norm-Referenced Tests: Construction. Interpretation, and Limitation 

As with all tests, norm- referenced test construction starts 
with the specification of te^t content Broad objectives or con- 
tent areas are selected and a large number of items are written 
to assess each. Items are tried out with a group of examinees 
selected to represent the kinds of students for whom the test is 
designed. 

Data from thi« initial try-out is critical for developing a 
test that will array student scores in a normal distributipn . 
Several item statistics^are computed to achieve this end. One^ 
statistic, the item difficulty index, is used to delete some items 
that too many students pass or fall and to retain items that are 
answered correctly by about half of the sample group. Another 
statistic, the item discrimination index, is used to identify test 
items which do not discriminate well enough between high and low 
achievers. In order to locate problems in multiple choice items, 
the number of students answering specific response alternatives may 
also be inspected. Once appropriate items have been identified 
through the analysis of the item statistics, test form are constructed 
and alternative forms of the t^st tried out agaxn with an appropriate 
sample of students. 

On this second tryout, the test developer is interested espe- 
cially in norming the test and obtaining estimates of test reliability 
and validity. The group perfoxinance of students participating in 
this norming tryout becomes the comparison group or norm group for 
interpreting te^ scores. For example, if three percent of the 
norm sample got four items correct on a 100 item test, the norm 
tables would say that a score of four would be interpreted as 
falling in the 3rd percentile. Because student performance on a 
norm-referenced test derives meaning principally by comparison to 
the norm group, it is critical that the norming sample be represen- 
tative of the students for whom the test is designed. The ideal 



16 



norm ^roup miBht represent the student population on the fol lowing 
dimensions: geographic region of the country, school size, 
community size, student demographic characteristics, and school 
type. Because norms can become outdated, tests are usually renormed 
every three to five years. The interpretation of norm-reference 
scores emerges directly from the fact that the test contains a- 
selection of items designed to provide a group performance which 
will be arrfl^flged into a normal distribution. Test scores provide 
informatioh OA hbw^ students compare to a national or local group 
of students who have taken the same tests, and are expressed as 
percentile, stanlne, or standard scores which have been dferived from 
the raw scores. Because test scores tell hbw well a student or 
group performance compares with the peer group standard, norm- ^ 
referenced tests are suitable for making decisions such as: 

o Who shall be selected into a program when there 
are only a limited number ^of spaces available? 

o How do our students compare with, students of 
similar age and background on this skill ar^a? 

There re some limitations on the use of norm^referenced scores 
that must be recognized. Because norm-referenced tests are designed 
to decide how well a student's performance in a subject area com- 
pares with other students, they may yield limited information about 
the amount of specific skills a student has mastered. Subscale 
scores are sometimes not sufficiently reliable, and often do not 
include a sufficient number of items to tell you if a student has 
mastered a particular objective. 

Criterion-Referencei Tests: Construction, Interpretation , And 
Limitation ' ' 

e 

In criterion-referenced testing, as in. norm-referenced testing, 
item construction procedures focus on selection of content for the 
test. Test constrtotion begins with the specification of the con- 
crete objectives to be tested. Ideally, detailed content specifica- 
tions are designed to include all of the salient parameters of 
content or skill domain, such as eligible content, vocabulary, 
syntax, readlbillty, readlbUity, and how dlstractors are to be 
written. Based on the objective specification, teachers should 
be able. to know exactly what skills and knowledges are being tested. 
Based on these detailed objectives or domai||s, descriptions of 
many items are then generated to reflect each obji&efive. Like 
norm->referenced tests^ tbes^ items are subsequently also tried out 
on appropriate slUnples of ^students. However, th0*« purpose of the 
tryout is not to **wee^d out" items that do not discriminate betwe3n 
high and low scorer, but rather to identify items that appear to 
be measuring the same objectives and eliminate items that do not. 



Items selected for criterion-referenced tests should be selected 
for their ability to represent the content rather than to soparato 
students. The ultimate criterion for Item selection is tho 
relevance of the item to the objective being tested. Often items 
that might be discarded from a norm-referenced test because they 
are "too easy" or "too difficult" are retained in a criterion 
referenced test because they measure important skills. 

Like norm-referenced tests, the best items are then assigned 
to test forms and administered to representative groups of students. 
^ On the basis of these results, test reliability and validity is 
determined. The results of students performance during this 
final try-out may also be reported in the test rpanual to give 
users some basis for comparing their students' scores ^ith other 
students, although providing comparative data is usually not an 
area of emphasis. Hence, criterion-referenced tests often become 
"normed." 

As a result of these test construction procedures criterion- 
* referenced test scores tell how^much of the content a student has 
mastered. Scores are often reported in terms of the number or 
percentage of items correct. Scores tell about how well students 
have mastered the basic concepts described in the test specifications. 
They usually do not tell you if. this compares ^ell or poorly with thef 
performance of other st4;idents^ in the same grade, although some test 
developers do provide you with information about the performance of 
a comparison group. Criterion-referenced interpretations are 
especially appropriate for the following kin^s of decisions: 

o Have students mastered program objectives? 

o What are individual students* strengths and 
weaknesses in this area? ^ 

• o .Which objectives am I teaching effectively 
and which not so well? 

Criterion-referenced tests, too, suffer from some disadvantages. 
Because these are relatively new kinds of achievement measures, the 
' techaical quality of many of these tests is questionable. Technologies 
for determining test quality ^re still emerging. Although these tests 
purport to give you specific information about student performance, 
many tests do not contain a sufficient number of items par objective 
to be definitive.. Further, test specifications often are not as 
fully described or as well-defined as they should be. In addition, 
test objectives are often narrowly focused with the result that a 
single test ia not equally suited to the curricula of different 
schools or classes, maki>ng interschool or program yeomparisons 
difficult. 



is 



- 18 - 



SormvObHorvat ions 



The uses of norm-referenced and criterion-referenced tests 
vary. Norm- referenced tests are generally used for selection . 
decisions, classification decisions, guidance decisions and in 
aptitude, interest, and personality inventories where comparisons 
are important. Criterion-referenced tests are generally used for 
classification and guidance decisions also, for placement and 
certification decisions, and to assess achievement of specific 
content . 

For instruction decisions a teacher probably will want to 
both assess mastery and discriminate among students in achievement, 
so both criterion-referenced and norm-referenced tests are used. 
Measurement specialist N. E. Gron^lund makes the distinction between 
testing the minimum essentials employing mastery testing* using 
criterion-referencing and testing fov maximum development employing 
discrimination testing using norm- referencing. 

Again, many tests are amenable to both norm- referencing and 
criterion-referencing and scores are reported both ways. 

Minimuu Competency Tests 

Minimum competency tests for students are a special type of 
achievement test- which have developed in recent years. Minimum 
competency tests are developed to yield. criterion-referenced 
interpretations, since the purpose of giving a competency test is 
to. find out what a student knows in 'relation to prespeeified 
objectives. Test construction procedures should therefore be muc|i 
the same as those for criterion-referenced tests, although some- 
times test developers will select items based on their ability to 
discriminate between those ifho pass the test and those who fail. 
Score interpretations are also similar to those of criterion- 
referenced tests, except by definition, scores on^minimum competency 
tests can be interpreted as acceptable or unacceptable performance/ 
that is, passing the minimum standard or not. Developers of other 
criterion**referenced tesfs also spmetimes indicate a passing score 
or cut-off score indicating mastery. 

Minimum competency tests share the problems and llmitationa of 
other standardized tests because the consequences of test performance 
may be very severe. To students not passing a grade or not receivinc 
a high'' school ..diploma, these problems are more serious. 

The task of deciding what goal or objective any test should 
assess is always difficult. The task of defining objectives that 
a ninimally competent person should attain is complex Indeed. For 




example, what skills are necessary for survival? What skills iivv 
truly basic and necessary? For whom? For what? In the absence* 
u( empirical data these decisions represent values and opinions 
that are known to differ among different individuals and groups. 

The objectives for the test must bear a relationship to 
what is actually taught. Courts have ruled on the need to test 
what is taught and to advise students in advanc^ of the need to 
pass the test for promotion or diploma. Challenges to the testing 
requirement have been filed in cases where special groups of 
students, such as handicapped, have been provided individualized 
education pTans that are substantially different from the regular 
curriculum. If a student meets the requirements of an individual- 
ized education plan^ must the student also pass the minimum 
competency Itest to receive a diploma? Questions such as these 
will be with the competency movement for some time. The answers 
may well shape the future course of education. In spite of 
the rapid growth' of competency tests since 1978, we are still 
closer to the beginning of this issue than we are to the end. 

In addition tp definitional problems, minimum competency tesrts, 
like other criterion-referenced tests suffer from problems inherent 
in any new technology. Reliability and validity are especially 
important, yet it is not always clear which statistics arc best 
to describe these test properties. 

Setting a passing or cut-off score for minimum competency 
tests is another problem area. Various methods have been advanced 
to deal with this^ problem, including* the use of expert opinion, past 
performance of students on the test, and statistical models incor- 
porating *the probabilities of misclassif ication (i^^-i identifying . 
a "competent** student as incompetent and vice versaf^ Although 
consideration of all these factors will likely yield the most 
reasonable cut --scorer <:urrentr practice often deviates from this 
ideal . 

In order to help insure student competency, the results of 
minimum competency tests should have. ;implicktions, for student 
remediation. As with criterion-referenced tests in general, the 
number of items assessing each objective often is not sufficient 
to determine reliably student mastery, and test objectives frequently 
are not defined well enough to suggest what specific instructional 
remedies are necessary. 

Aptitude Tests and Achievement Tests 

While achievement tests , are designed to measure past learning 
of school specific content, aptitude tests are designed to predict 
future achieveaent , or the ability to acquire new content or skills. 
In theory, an aptitude test could be either norm-referenced or 
criterion^* referenced. However, most standardized aptitude tests 



-14-20 



are constructed to yield norm-referenced interpretations, that is 
how a student's potential coigpares with others. . a result, 
aptitude measures are often used to identify- students with special 
needs and abilities, for placing students into special programs, and 
for guidance' and counseling purposes. 

The distinction between aptitude and achievement tests appears 
conceptually, but in reality their functions overlap. Althourh 
aptitude tests ai>e apt to be less dependent on specific school 
content, both Rinds of tests measure learning and previous experience, 
both in school and out of school. For example, while an achievement 
test may measure number of concepts and basic computational operations 
an aptitude test might assess problem-solving ability, figure 
analogies and abstract reasoning. Depending on the curriculum of 
a particular school, these latter skills may represent aptitude or 
if taught as part of the curriculum, achievement. 

The distinction between aptitude tests and achievement tests 
in terras of predictive value also is somewhat muddled. Although 
aptitude measures purport to measure potential and predict future 
success, it is clear that achievement tests also can serve this 
purpose. That is, prior learning in a particular subject is often 
the single best predictor of future success in that area. For 
example, if a student in the second grade performed well cn a read- 
ing achievement test, it would clearly be expected that the student 
would do well in the third grade reading program. 



Although achievement and aptitude tests can serve similar 
functions, aptitude tests do have some advantages. For example, 
they can be used with students who have had no prior exposure to 
a subject, and they are often less time-consuming than achievement 
tests. Consequently, aptitude tests can offer an efficient method 
for screening and selecting students. However, because the content 
of these tests often is not tied to specific subject areas, the 
instructional implications of test results are limited. For example; 
if a student performs poorly on a mathematics achievement test the 
results might indicate that Instruction and practice were needed in 
geometry and ratio problems. If, however, an aptitude test shows 
a student is low in non-verbal ability, what Instructional actions 
should be taken by a teacher? 

Because the results of scholastic aptitude and mental ability 
tests can have serious consequences for students, they must be 

Hnterproted >with great caution. If a. student does poorly oh an 
aptitude test, 4t may mean that the student has a poor chance of 
succeeding in certain school subjects. Equally likely perhaps 
poor performance cOuld be due to the, fact that the student did not ' 
have the environmental opportunities to develop the ablrlities in 

^question. Here is raised the issue of cultural bias, a problem that 
exists in achievement tests as well. For example, one common 



aptitude test item shc^s three sailboats in a picture at different 
distances from the horizon, and asks the students to identify which 
boat is furthest. Reading of picture cues is a culturally em- 
bedded 'skill that is taught in school^in come countries. A student 
who failed this item may have done so because the student was not 
taught how to Judge perspective in two-dimensional space. Thus, 
it often is difficult to determine exactly how much of an aptitude 
test score is a function of cultural difference and prior chance 
to learn and how much a function of low ability. Cultural dif- 
ference also may affect other aspects of test' performance. For 
example, factors such as examiner-child rapport, anxiety, motiva- 
tion, understanding of directions, etc., cleariy^ will Influence 
student test scores. Consequently, matiy have argued that mental 
aptitude tests are biased toward middle class culture, and dis- 
criminate seriously against minority group members, students from 
non-English speaking backgrounds and those from lower socioeconomic 
backgrounds. Some states have banned. intelligence testing 
entirely, while others have mandated procedures to ensure that 
tests do not receive undue emphasis in decisions about student 
'futures. Others have argued that those skills and knowledges 
that are needed to succeed on these tests are tne same ones 
needed to succeed in school and society. 



16 • 22 



V 



Chapter III 

SELECTING STANDARDIZED TESTS TO SUIT YOUR PURPOSES 
STANDARDS FOR EVALUATION 



Test selection Involves a consideration of the quality of / 
the test instrument and the appropriateness of the test in ful- 
filling local purposes. The end result of the process of test 
selection is not so much a compromise of either standard, but 
rather an awareness of the limitations of any test so that 
accommodations can be made in the use of test results. It may 
be that- the information gained by testing will not provide 
that the district originally sought, but the appropriate^^se of 

__the test has limits which cannot be exceeded ijL-the interpl-et- 
ations are to remain accurate. Some liidgmeifts about tests must 
be made by persons qualified in the selection and use of tests. 
This chapter will identify the measures by which tests are judged 
and suggest some ways of selecting tests so that local goals can 
be more easily and more realistically reached. One of the most 
Important considerations will be how closely the tests match the 
district's practices and goals. By knowing a little about the. 
construction of tests a person can reasonably be expected to - , 
make an appropriate ^election of a quality instrument. The key, 
iMwev^/ is which instruments of comparible quality will perform 

/the functions that allow the most economical and effective use 
of student and teacher time and effort. 

Test selection is a little bit like sculpting an elephant 
aut of- large granite rock. With hammer and chisel you knock 
away everything that does not look like an elephant until the 
sculptMre is Complete. In selecting a test you will eliminate 
t4ip8e tests, with features inappropriate for your use of which do 
not add anything to the accomplishment of your goals. You will 
eliminate tei^ts ;yhich do not measure up to the Standards of 
test construction. In the end you must choose between tests 
which to some extent meet your criteria. It is judgment by 
professionals at this point which will produce something of 
value, something of continued usefulness, and something which 
will allow districts to proceed with thfe assessment of students 
to meet the identified needs. 

Before one decides to test it would he useful to think 
a'uout what one is attempting to accomplish. Measuring achieve- 
ment or aptitude is not as easy as some might think. Imagine 
for a moment that you have come at the close of a late summer' s 
evening to a small lake set in the woods. The half moon Provides 
Just enough light to see: An evening breeze blows a mist about 
the lake, obscuring but not completely blocking the shape of 
objects. You decide to recreate this scene and relate it to 



others. How wou^d an artist begin to -capture this scene? With 
limited tools and skills it is difficult to portray the changing 
shape of mist, the shrouded tr4des, and to add a sense of the other 
stimuli present ^uch as the coolness of the moi^t night air < 
against your skin. 

The ^measurenient of aptitude and achievement is Just as 
elusive. We must proTceed with measurement tools which are Im- 
perfect and observation skills which are limited. The artist 
would struggle to see in the mist the shapes that provide dim- 
ension and then try with brush or pen to portray what he saw. 
When observing the results of the artist, the viewer attempts to 
supply many of the intangibles of the scene based on his own 
past experience. In testing we observe the product of the 
existing too]Ls of measurement and struggle with a desire to 

relate our observations to our own experience. 

« «. 

There is a demand for some proof of the existence of learn- 
ing. Some have real or imagined uses for the outcomes which de- 
velop as* a result of teaching and learning. It requires a skilled 
observer to assist in identifying the detail which would be mis- 
used by most persons. In this respect there is a role for teachers 
in testing and .for critics in art. Both are sensitive to the 
medium and tne subject and are perhaps the best resource to de- 
scribe what occurs in either process. Such comments along with 
reflection based on our accum^latfon of experience allow us to 
infer certain things about what we could expect in future exper- - 
iences of this kind. Supportive and reminfiscent , sketches and • 
interpretations of life known as tests help people think about « 
their concerns. This access to events occurring in another place 
and time is at the„ heart of measurement, the reason that art 
exists. 

If the decision is made that tests are to be used, then the 
first task is to establish the purpose of the testing program. 
Secondly, one must search for a useable testing instrument. The 
usefulness of a test will be determined by how consistently* it 
provides the information soug^ht and the degree to which the tert 
is capable of achieving stated goals. These are> the primary 
considerations in test selection. 

Technical Properties of a Test 

The properties of standardized tests which a school district 
might first consider in the selection of a test would include ' 
consideration of the reliability and the validity of the test. 
Estimates of relia'^ility and validity are provided in information 
about the test supplied by test pul?lishers and incependent re- 
viewers. Test publishers complying with the Standards for 
Psychological and Educational Testing will describe the results 
of several measures of reliability and validity along with other 




descriptive data in test manuals. Oscar Buros's Mental Measure- 
ment Yearbook lists and reviews a large ^number of cXirrently pub- 
lished^ tests. Other sources of information are available and 
will be detailed la£er in th'is chapt;er. These sources may be 
* ' useful in making comparisions abouVthe qualities and/properties 
or tests as they relate to each other and to the goals of tne 

district testing program. This section discusses how these com- 
parisons are developed and suggests some cautions to be considorcd 

J Typ'ically a district would use standardized achievement 

} tests to make generalizations about a person's performance on 
Measures of developed abilities. From personality inventories 
and aptitude tests mentioned in the previous chapter, other 
inferences' are made. Any measure of behavioral or educational 
characteristics of a student — the non-physical measurements — 
rfre subject to some error. It is the nature of psychological 
and educational testing' that the process is less exact than 
physical measurement. When using test results, teachers, coun- 
selors, and administrators need an awareness of the possibility 
for error. Test results are useful in drawing Inferences and 
generalizations when used in combination with other factors to 
develop Judgments. It is not correct to abandon the use of tests, 
simply because they do not^ allow us to weigh in scales the 
thoughts of man. The weight of a gold ball should remain con- 
stant through -several weighings. Unlike a gold ball, the edu- 
cational and psychological characteristics of students ore in 
flux. Our efforts to measure those characteristics will bo 
affected by the changes that take place within each individual. 
Even if wc could assess the same group repeatedly with the same 
instrument, which is not a flkely or probable occurance, the 
results of the t^s^ ^uld vary according to factors largely in- 
dependent of the test instrument. 

• * PART A 

RELIABILITY 

Reliability is used in testing to indicate how consistently 
a test can measure performance over time and when administorcd 
i in somewhat different conditions. When the reliability of a 
test is high, individuals will retain their relative rank when 
scores are compared to others in repeated administrations ol 
the test. A test must provide consistent results even consid- 
ering that students will often take the test in various locations 
and with different test administrators. These differences will 
^ produce slight variations In instructions, timing, and other 

environmental factors. To be useful to .a district,, a test must 
I be tolerant of such variations anrd produce useable results in 
spite of the differences. Some of tliese Variables affecting 
the ^consistency of tests are trait instability, sampling error, 
administration error* and scoring errors. 



9 



Trait Instability 

Trait instability is another way of saying how much edu- 
cational characteristics will vary over time. It is one of the 
variables in testing that is independent of the test. What a 
student knows and forgets about a body of knowledge or domain 
will change with the experiences he nas. ' Thfe reaction to the 
different questions used. in a testwill vary. 

Sampling Error 

Sampling error is the term used to describe how ttie choice 
of questions affects the reliability of the test. Since test 
questions will produce scores from which inference a*)out the 
knowled,ie of a specific student or groups of students is made, 
then some provision for identifying this factor must be included. 

Administrative "Errors ' 

Errors in the administration of attest may be' made. While 
such error should be limited, the complete elimination of ferror 
in this area is not likely. The test must be flexible enough to 
accourtt for a range of individual styles and conditions in test 
administration. 

Scoring Error 

• " . » 
Scoring error ii^ a mistake* on tests when a "student knows 
the cori^ect answer, but marks jthe_answer incorrectly. A less' 
than accurate picture of how well the student knows the infor- 
mation being tested is presented, ^uch errors should not be 
overlooked as to their effect on test reliability. ' 

Error Variance- ' •. > 

In a reliable test, finally, the personal qualities that 
the student brings into the testing session should not drastically 
affect the outcome. How a person feels physically, how interes- 
ted he is in takihg the test, even how lucky he is in guessing 
the correct answer will all affeqt the reliability of the test. 
In the construction of a test., the designer must not only take 
into account these non-test variables, or error variances, but 
must also explain in the tect information the result of con- 
sidering these error variances and how reliable the test is 
after taking •'nto account these possibilities. 

Reporting Reliability 

Many kinds of variables will affect test scores. It is 
reasonable to expect that test publishers will provide infor- 
mation they feel is important about the effect of variables on 
Jl^e reliability of their test. Since reliability can be estab- 
lished ij> a number of ways that will be different for different 



RIC 



. 20 



- 26 



kinds ol tests, the district's purposes, and not those o( th(^ 
test publisfher> should be the criteria against *vhich rcli/bility 
is Judged. 

FoT^ iBxampIre , if you were looking for a competency test you 
would likely select a test that ha« used equivalent forms to 

establish reliability. You want to have a test that is very re- 
liable around the passing point as opposed to the high or low 
end of the scores. Equivalent ' forms of the test tend to provide 
a superior measure of . the test's consistency around the passin^j 
point or cepitral point of the scoring range. In, a test that is 
to.measurc^astory of a sl^ill or domain, It might be more uselul 
to detdji&:]ino very accurately the scores, at the top end of the ' 
range. ^ TMe. various methods of . establishing test reliability 
are ^presented so that in selecting a test the method that best 
fits the district's needs can be examined. 

Measures of Reliability 

^ Reliability information, as reported in reviews of tests, 
will frequently be reported in quantitative terms derived through 
statistical analysis of the results of test administrations. The 
rssultr of this analysis^ commonly reported as a correlation co- 
efficient, will show how closely the results compared agree with 
each other. The correlation coefficient is a measure of relation 
•hip. What the test reviews will hope to show is that a r-liable 
teet will produce test scores that have a high correlation co- 
, efficient. The correlation c'officient is expfessed as a value 
ranging from ^^1.00 to «-1.00. lo indicate performance on the 
test which i^ identical for the same person or groups of persons, 
a >1.00 correlation would be provided. To show complete opposite 
perfornanca* a correlation of -1.00 would be assigned. To show 
that there was no relationship at all a correlation of 0.00 would 
result. . For a reliable test instrument, a strong positive cor- 
re ation la desirable. As a general rule, a correlation of ^0.85 
would be acceptable when considering a test for which group in- 
ferences will be made. The specil^ic needs of the district will 
govern what correlation is acceptable, but a correlation close to 
sero will indicate a test which will be less than useful* 

^ Tests of Reliability 

TO assist with test-to-^test comparison » publishers provide 
mich of the data one needs in assessing reliability. The re- 
liability indices reported in the test manual or by reviewers 
of tests are likely to be determined by one of the following 
^ Mtbode: test-retest, equivalent form:;, snl it halves, or Kuder- 

Rlebardson. They measure equivalence or stability in the test 
, and would be reported as appropriate to the test. 

' ^est^Retest Methou 

Test-tetest reliability is determined by administering a 
" test to a group of persons, and then re^administer the same 



« 21 - 27 



test .) the same group at a lat^r time. The scores lor both 
tests for the group are correlated. When reporting reliability 
in this manner, the test should indicate how long an interval ol time? 
will affect the way a person is likely to answer a particular ques-- 
tian. If the interval is too short, the learning that occurs in the 
first test administration is likely to be carried over to the- 
second test. The student may mark an answer incorrectly on the 
second test the same way as on the first test because of memory 
effect. If the test is to measure developed abilities, this 
kind of carry over will give results that do not reflect the 
students' true ability. Tests in/dthe psychological domain are 
not so severely affected by 'thi6 kind of error. 

In addition to the score variance that occurs as the same 
person retakes the same test over time, other Dossibilities for 
error variance present themselves. Different questions on sep- 
arate forms of the test might be more difficult for some students. 
To correct for this error, measures of equivalence have been de- 
veloped in an attempt to identify such potential pi^oblems. 
Equivalent forms of the same test which are ejqual in content and 
statistical properties are administered to the same group of 
students. Sometimes such forms are described as being parallel 
to emphasize that the content is to be similar between the two 
forms An example qf this would be to weigh the same object on 
two different scales on the same day. Assuming the object does 
not physically change between the two weighings, we would expect 
the differences in weight to be due to the difference in the in- 
striunent used for measurement, ^'or to measurement errors. 

Equivalent Forms Method 

In parallel or equivalent form measures of reliability the 
student will take both tests and the scores will be compared. If 
there is a high correlation we assume the test is reliable.. In 
selecting the two forms to be used, consideration should be 
given to similarity of content, mean scores, and variances for 
eaca. If th^ tests follow closely in succession, it is likely 
that the differences in scores will be due to the differences 
in th * forms. The range of difficulty of iteras^, format, time 
for administration, and examples must all be CaVefully consid- 
ered in the construction of the tests to be compared. Because 
of the difficulty in creating such equivalent forms, test de- 
signers have sought other methods to establish reliability. 

Split Half Method 

When it is impractical to test the same group on more than 
one occasion, or if alternate forms of the test are not available, 
items making up the test scores cao.be examined to determine re- 
liability. Student perf^ormance on individual items on the test 
' is related to the total test score. Split-half reliability is 
one. such method of establishing the internal consistency and the 



- 22 ^88 



homogeneity of test items. The scores of the items are separated, 
the sub-scores determined, and then correlated. If the sub-scores 
are identical, then we have a measure of how reliable a test only 
half as long as the original might be. 

Kuder-Bichardson Method * 

To avoid splitting the test to check reli|rt5nity, th^ Kuder- 
Richardson formulas can be used when the items^are scored either* 
•|right or wrong, or in some "all or none** type tests. The Kuder- 
Richardson formulas, K-R 20 K-R 21, are used in situations 
where items on the test are assumed to have either the same level 
pf difficulty or different levels of difficulty. K-R 21 is the 
formula which assumes constant difficulty levels for all items. 
It is oiore useful for classroom teachers to know because the com- 
putation is direct and easy. The teacher must only compute the 
mean and variance for the test and substitute the values deter- 
mined and the number of test items into the formula. 



The K-R formulas produce correlation coefficients comparable 
to procedures used earlier. Because of the nature of the K-R 
formulas, there will be generally lower correlations from tests 
that measure widely varying skills and content than from tests 
which require the saine kinds of skills and cover similar content . 
For example, a test which consists entirely of vocabulary words 
will likely produce more consistent results internally than a test 
that includes vocabulary, spelling, and math computations. The 
first kind of test is said to be homogeneous and the second is 
said to be heterogeneous. Individual variations are going to be 
greater on heterogeneous tests because all skills of an Individ* 
ual do not develop at the same rate and to the same extent. Even, 
though reliable assumptions can be made from test performance on 
tests that cover a single skill, the purpose for testing in your 
school may be to assess a variety of skills and aptitudes. ^ 

In such cases a lower, but acceptable correlation measure inay 
be the best choice. / 



The purpose for testing as well as what the test is supposed 
to njeasure must be considered in examining the measure of relia- 
bility and item consistency. If you wish to determine how het- 
erogeneous the test is, the difference between the correlation 0 
for the split^half reliability and the Kuder-Richardson correla- 
tion would be an indication. Kuder-Richardson formulas will . 
show a lower correlation for widely varying skills. Split-half 
measures of reliability would likely produce a higher correlation 
on such rests. Knowing the difference can help to identify the 
heterogeneity of the test under consideration. 



Comparing Reliability Method s 




- 23 



23 



1. 



other Factors Affecting Reliability 

In a longer test, a split-half reliability measure will produce 
a higher reliability coefficient or correlation than when 
the separate parts are considered. This will continue to hold 
true as the test is lengthened by the addition of equivalent 
items. It will work in the reverse as well. When equivalent 
items are taken out of the test, the correlation coefficient will 
decline. Practical considerations of how much time can be 
alloted to testing will determine the most appropriate test 
length for individual school purposes. It may be that a highly 
reliable tost which is too long will be passed over for a 
shorter, but less reliable alternative. 

" The speed qualit^i^ of the test will also affect the relia- 
bility. As the internal consistency of a test is examined, some 
differences in item difficulty will be found. A test which em- 
phasizes speed over power (knowledge brought to the testing 
situation) will be designed in such a way as to allow most stu- 
dents to answer all of the questions. speed test, dif- 
ficulty level of items is quite low and tne items are very 
consistent. Because of this the reliabili ty^coeff icient will 
appear to be quite high when items of the test~are~compared to 
one another. If a test is selectied for its speed qualities 
alone, a reliability measure such as test-retest or alternate 
forms would tell you more about the test's ability to accomplish 
its goals than one which simply examined halves of the test 
or looked to other internal consistency measures. 

Another factor affecting reliability measures is th€^ homo- 
geneity of the g roup . If you have a group of students that ief 
quite similar to^the group on which the measures were established, 
then in a reliable test the results can be expected to be con- 
sistent. For example, suppose an achievement test undeV con- 
sideration is to be administered to a group of seventh graders. 
On^ would expect individual seventh graders' scores on the test 
to vary more randomly than if the test were given tp students 
of markedly different ability levels such as might be encountered 
in a grade span of fourth through ninth grades. Assuming the 
items to be consistently difficult, the fourth graders should 
score lower on the various forms of the test and the more 
advanced sttidents higher. When testing a group similar in age 
and abilities, such as students in a single grade, the probab- 
ility of consistent scoring differences occurring drops. The 
characteristics of the norming group can be compared, to the 
groups to be tested to determine if the reliability of the test 
can be expected to remain consistent. 

Similar to group homogeneity, item homogeneity will affect 
the reliability of the scores. The difficulty of the items used 
will determine high or low reliability based on the number of 




^^4 - 



30 



persons who correctly answer the question. Consistency of scorinp 
will affect the measure of reliability if everyone answers the 
majority of items because they are easy, or everyone misses the 
majority of answers because they are more difficult. The test 
examiner is hampered in determining reliability if student 
ranking is not apparent from the test scores. This is because 
reliability, bj^ the definition we are using, calls for the 
same individual to retain the same rank in relationship to 
other scores through repeated admissions of the test. 

These Variables are taken into account in the development of 
the test. The results of the review of the test show the degree 
of consistency of results over repeated admissions of the test. 
The selection of the test most appropriate to your purpose can be 
made using this information. Once sufficient reliability is es- 
tablished, other measures of appropriateness of the test can be 
considered. 



PART B 
VALIDITY 



A test may be reliable and can provide consistent results, 
but will be of little use if it does not measure what we are in- 
terested ^n measuring. If it purports to measure what we hope 
to learn about, but is inaccurate, then it is still not useful. 
No test can be said to be absolutely reliable or valid in the 
abstract. It is quite possible to have a test that is highly 
valid for a particular purpose, but invalid for others. Validity 
is described as the degree to which a test measures what it in- 
tends to measure. There are four kinds of validity measures. 
They are content validity,, predictive validity, criterion-related 
validity, concurrent validity and construct validity. Th^re is 
also a measure or eoilslderatlon known as face validity, which is 
technically not * validity measure, but' Id related. Predictive 
validity and coiicarrent validity are .'often grouped together and 
called crlt^rlon-'relatiid validity. 

Content Validity 

Content validity is the extent to which a test measures 
the subject matter content and the beh^ftvioral changes under con- 
sideration. 



Content validity is of particular concern in achievement 
testing. A test is carefully analyzed to determine the subject 
matter content covered and the responses test takers are expec- 
I ted to make, compared to the domain of achievement to be 
measured* 

To Judge content validity, first the content domain must 
be defined. This Involved consideration of both the subject 
matter and the type of behavior or task to be measured* Both 




the content and the process are important. Content domains and 
behaviors which are tested by the instrument should be identiliod. 
It is the responsibility of the group involved in the selection 
of the test to compare the domains and behaviors tested with 
their own goals to determine the best available match.. Because 
of the practical limitations on test length, only samples of 
these domains and behaviors will be Included in a test. ?est 
sp'^cif ication tables and grids which identify the objectives 
\ of the teat and the content and behavior which will be tested 
can be used to decide if a sufficiently large and representative 
sample has been Included. How many subdivisions are included 
for each major category will vary. Because a test for a dis- 
trict will be given to students from a large number of teachers, 
the detail of the content and behavior to be considered will 
be different for groups of students. It would be helpful to 
the classroom teacher to see the widest possible^ consideration 
of specifications so that the practice for the greatest number 
of classrooms can be matched to the specifications of the test 
instrument. Then» after the inspection and comparison, judg- 
ments, can be made about content validity. 

In addition to seeking the individualizations which occur 
from classroom to classroomi it is important to remind those in- 
volved in tKe selection process of the need for ^ome continuing 
standard against which the test can be compared. The text for 
the clasdroom, or the supplementary materials provided by the 
district will give some clues as to what the actual content 
taught in the courses might be. Dr. Andrew Porter of the Cen- 
ter for Research on Teaching, Michigan State University, has 
been working to develop methods of identifying the relationship 
of the major reading textbook series to the most commonly used 
achievement tests. By using a matrix of items from chapter 
exercises and the test items, comparisons are derived. Such 
*:inds of studies may help to more clearly describe what items 
from the texts are sampled by the tests. The process appears 
to have application to other non-textbook materials and cur- 
riculum inclusions as well. If successful, a major obstacle 
would be eliminated and test content could be more accurately 
matched to the classroom instruction. 

Examining for Content Validity 

'» . * 

Hour possible threats to content validity should be taken 
into account when examining a test. What is the extent of the 
mismatch between program objectives and the objectives of the 
test? While it would be unumial to find a test that matches 
perfectly with the program goala of the district, the test 
should address itself to district goals as closely as possible 
if program evaluation is the objective. 

Does the instrument really test tne skills that it is inteh- 
ded to? Is It irtuch broader or narrower in scope than it claims to 
be? Suppose writing skills are to be tested. Upon review of 

^ 26 

32 



a test you judge the skill tested most frequently bo be that of 
proofreading. The test would be more valid for assessing 
editing and correcting skills than for assessing skill in 
sentence construction, style, etc. 

Is the vocabulary or the format of the test familiar to 
students or does the test ^ely on a specific set of curriculum 
materials? In some instances tests will be developed to match 
specific materials sets, and would include vocabulary that would 
be unfamiliar to the students. Content and format of the test 
may be equally dependent on materials used, so a thorough exam- 
ination of the entire test is in order. 

Are there eno igh items for each objective to be tested ac- 
curately? In some cases a test may be generally valid for all 
other purposes of the district, but may include too few items 
relating to some subskills. If less than five to eight items 
are included, it nay be difficult to make some statement about 
a student's skills in that area.. For inferences about groups of 
students, a few less might give some indication of how the group 
might do, but the emphasis of the test should ideally match the 
emphasis of the district's program as often as possible. 

Predictive, and Concurrant Validity 

The next two forms of validity are grouped under the heading 
criterion-related validity. When using some criteria to val- 
idate the test, it is possible to collect the data at the same 
time, or concurrently. This procedurr is used to determine if 
the test provides the same information as some other measure, and 
the extent of agreement between the two. Predictive validity 
is generally determined by collecting the data at two different 
times. It is often used to see how well the test can predict 
future performance. Concurrent validity determines to what ex- 
tent the information from one test can be substituted for another. 
Predictive validity seeks to establish' if performance on a test 
can pi-edict some characteristic. 

One problem with the use of criterion measures for comparison 
is the lack of agreement that will surround the various possible 
cholxxes. For example, not everyone will agree that grades at 
the and of a course are fair criteria. They might argue that 
such a criterion could be affected by the subjective Judgment 
of the teacher. Suppose teachers had access to previous grades 
or test scoresl An inclination would be present to grade accor- 
ding to how well they thought a student would probably do, and 
not on how well the student actually performed. Others might 
aricue Just as persuasively for the inclusion of grades and cit*j 
the relationship between grades and future success in school. If 
there le a disagreement, then persons determined to be Expert in 
the field might be consulted as to what criteria would be con- 
•isteat enough to use. 

27 - 

ERIC ' . 33 



Construct Validity 

The kictds of valir^ity »w*^aBure8 presented so far have been 
all related to some specific practical use of tests. Another 
type of validity has application to test interpretation in re- 
lationship to some' psychologica^l theory. Construct validity, 
attempts to explain some psycholof icai quality whrich we feel 
exists in order to explain some aspect of behavior. In 
establishing if a test has construct validity, one first needs 
to determine what constructs might account for certain behavior 
on a test. Some assumptions or hypotheses can be developed for 
that construct. Finally one would seek to verify the assumptions 
by logic and by empirical procedures. In the process both the 
test and theory are validated. 

One aasumptJLon about intelligence is that it will increaso 
with age. Another is that test^cox^s on Bome standardized tests 
will differ with certain groups^'such as the educationally handi- 
capped and the educationarlly gifted. Other characteristics of 
' intelligence might be identified as wen. The results of tests 
given to the different groups can be examined in light of the 
assumed characteristics of intelligence. If the results match 
or have some high degree of correlation, then the test might be 
considered. valid. If they do not match, the test might be con- 
sidered invalid and the theory correct. Another assumption 
would be that the underlying theory of the psychological con- 
struct is wrong. Whichever one chooses to believe, the inde- 
pendent judgment of th^ teacher and of groups of teachers must 
be applied to the question, "Brecisely .what does this test 
measure?" A review of the total evidence available about the 
construct considered should give some clue as to the usefulness 
of this measure of validity. 

Face Validity . • 

Finally the test should appear to be valid to^the casual 
observer, to the student who takes the test, and school personnel 
icho participate in the administration and use of the test ftnd 
Its results. While face validity is not technically validity 
at all, but ratheir a Judgment about how the test might be con- 
sidered, it is never timeless important. If parents feel the test 
is irrelevant, then they might not seriously consider the out- 
comes. Students taking the teat might not perform with enthus- 
iasm and concern if they feel somehow the test is not a serious 
effort on the part of the district to ^ain information. The 
appearance or face value of the test xj considered, and the . 
test selectors must decide if by outward-^ appearances persons 
will seriously consider the test and the results obtained by 
using it. 



- 28 - 34 




V 

PART C w 
. TEST NORMS 

Test norms are provided by the test publisher to allow 
those involved in selecting tests to determine if the group 
upon which the test was normed or standardized is similar to 
the group of students in their district. Not every instructor's 
manual for a test will describe in detail tike specifics of the » ^ 
norming population. This unfortunate occurance has been allowed 

to 1>ecome practice in too many instances^ and accounts for a great 
deal of the* confusion over lower than anticipated test scores. 
What if, for example ^ a test was normed on a sample that was geo* 
graphically removed from the district considering the 'test. Cer- 
tain speech patterrs and idiosyncrasies can be introduced into 
the vocabulary of the test which would be advantageous to stu- 
dents from the norming area. ' Tou need go no further than the 
definition for a flavored, usually colored, carbonated beverage 
popular with both children ard adults. Students from the east 
coast of the United States wjulA certainly mark ^'soda^ as -the 
correct answer. The boys and ^irls from Kansas City would be 
surprised if the correct answer were anything but "pop.*' 

The influence of television advertising on children might ^ 
affect the answers given by different age groups* Early primary 
children would possibly believe that ^*S-E-I»T<-Z" spells baloney. 
After all, they were told that in certai^n advertisements by 
Seitz, a manufacturer of processed meats. Older test takers 
might be inclined to spell **relief" "R^LAIDS," especially after 
completing the test section on mathematical computation. 

Some general rules to keep in mind when considering the nor- 
ming population in regard to the district population would in- 
clude sufficient size, divarsity, age, geographic location, and 
other characteristics to draw comparisons to the district which 
will be giving the test. Such information is available from 
iest publishers who do not provide local norms, but can offer 
assistance in helping the district to determine them. Th^ dis- 
trict can have its test scores reported in such a way as tp re- 
view the entire district and make some assumptions about how 
the local population can be expected to perform over time. It 
is a process of matching local and national characteristics. 
More importantly, a judgment must be made as to what the effect 
of a mismatch will be irtien matphing national information to the 
local population. 

Publisbsrs will usually indicate at what tine of year the 
tMt WM BonMd. &OIM will provid« information for more than . 
on« tine; for •xanpl«, th« Metropolitan Test reports spring 
and fall noms so that .thm differences which occur over the 
sehool foar will be accounted for in the conparisons. Tests 
that h*v« boen recently nomed will be the most useful. The 
content 9f the test can be examined. If the content is severely 



3| 



dated, further invest igc^t ion of the norming date shoyjd be made. 
The copyright date on a test 4d not always an indication of the 
norming date, so specific information should be obtained from 
the publisher. What constitutes an adequate size is relative 
to your needs, but a good rule is to look for a test that has 
been normed*on a large sample taken from various locations. A 
test given to a large number of students at a very few locations 
is not as likely to be as representative as one which was given 
to & group of comparable size at a variety of locations. The 

use of a few schools might severely skew the data and make 
comparison impossible. 

: Norms will tell how persons as a group performed on the 
test, and not how they should have performed on the test. They 
will allow comparisons to be made at t^he district level about 
performance, but will not indicate what the level of performance 
should be. They will not indicate advancement without looking 
.at previous scores. It might be more useful. for the teacher to 
know a child improved two grade levels than to know that the 
child is scoring at the a>Aerage for all students nationally. 

— ^th^ choic^^tlnorms^ either national, special group, or 
local, for reporting and comparison or dTstrlrctr scores inu^ be^ 
weighed carefully against the goals of the district. National 
norms, however large and diverse, will likely have been collec- 
ted^ through tests administered at scfhools. Considering that 
different age groups attend or drop out of school differently, 
one realizes the possibility" for excluding some people exists. 
If the national norm was established for eleventh graders and 
the national dropout rat« was ten per cent, then the dropout 
rate in the district using the test might be a consideration if 
it exceeds or lags behind the average. Special group or fixed 
reference norms might provide a consistent standard if they Jit 
the district's particular needs. Local norms might be fine for 
decisions about the school program from location to location, 
but might not satisfy the demand of legislators who seek infor- 
mation about how well the students in the state qr district are 
performing compared to the nationaj average. 

PART D 
STANDARDS FOR EDUCATION 

It seems appropriate to make a point %t this time about 
standards. One of the advantages of standardized tests is that 
their use over the years have produced results that are gener- 
ally recognized by the public and many educators as acceptab^'^ 
standards. If there are misgivings abou.t the standards dernon- 
strated by test scores, it is a matter of degree and certainly 
not an absolute denial of their worth. Experts in educational 
measurement can point to significant improvements in tests and 
test use. These changes came about for a variety of reasons. 
Some changes have resulted, in fewer deficiencies in tests, but 
most of the attention lately has been centered around the use 
of test results. This is where the greatest potential for 
abuse lies. 



- 30 - 36 



i 



« 



Repeatedly in this manual it is stated that the selection 
and use of te^ts must fit local goals and objectives. The in- 
clusion of teachers and school officials irf the selection of 
tests that best tit their goals and curriculum is stressed. 
This remains true and vital if tests are to be properly used 
and of value to a district profcjram. If, at the same time, °after 
extensive examination of available tests there are apparently 
none which are reasonably close to the content of the curriculum 
taught or seemingly none of the available instruments will fit 
the needs of the testing program, then perhaps aiP examination 
of the district curriculum might be in order. A great many in- 
novations in curriculum and programs have developed. Not all 
are appropriate for the needs of students who must compete in 
a real world upon graduation. While it is not likely that any 
test will exactly match the specifics of each district, it is 
likely fhat for the test to be marketed profitably and econom- 
ically offered that the test must have road appeal. Suppose 
the district needs and goals are so di\ .rgent from other dis- 
tricts that none of the available tests can provide enough in- 
_???™*^l25_J*^^^ your program to be usfeful. It is possible that 
students are~berng~prepared^oir ar~soe^ety^^vas$ly dif ferent^Xrom_„ 
the one in which they will have to participate. This kind of 
injustice far exceeds any abuse that an inadequate test or the 
improper use of test results might generate. Concensus of many 
as to the needs of the group is often the vehicle which offers 
the best possibility for compromise. Students will be the 
beneficiaries if teachers are widely involved in the decisions 
affecting curriculum offerings, testing programs, and the overall 
goals of the system. Fears about teacher groups abusing cur- 
riculum goals are unfounded. Natural limits on the power of ' 
any single constituency exist in free collective bargaining. 
Teachers should pursue bargaining beyond economic issues and 
press for participation in the discussions relating to educa- 
tional matters. Through this forum the observations of the 
classroom instructor can be presented* 

PART E 
PUTTING IT ALL TOGETHER 

The selection of the test or tests that best serve pur- 
poses determined by the district is dependent upon a wide range 
of variables. Systematic procedures for consideration of these 
variables will certainly help the process, especially if es- 
tablished at the local level to meet the requirements and th^e 
resources of the local district. For this reason only a con- 
cept Is offered as a basis for proceeding. No detailed for- 
mula or recipe offered here is going to contribute much to 
the establishment of a specific plan. Education about tests 
and involvement of the widest range of opinion and participants 
In the selection of the test offers the best hope to accomplish 
the task. The rough elements of test selection should include 



IS?-*^ J ♦ °' * sroup of individuals who are knowledRe- 

able about the advantages and limitations of tests, if the- 
knowledge is insufficient, then education must precede action 
Careful review of the district's goals in education, and the 
part that testing plays would be 9, logical step to follow. An 
examination of the available tests can be made, but needs to 
oe done considering the actual curriculum covered. A sequence 
®5J*''8es from this kind of reasoning, and sequence is probably 
the key to test selection. Education precedes the collection 
of information. If enough information has been gathered to 
make a decision, then proceed to the next step. If the know- 
ledge about the application of the information seems too limited, 
then logic calls for more information or education. Teachers 
should not feel as if they are being rushed to judgment by 
either district officials or by others with an interest in the 
sale or acquisition of tests. Reputable test publishers can 
be expected to provide assistance and reference to assist in 
the selection of the instrument. Those who are reluctant to be 
of help must receive the message that without the support and 
assistance necessary tp make a decision with which teachers will 
be comfortable, there will be no selection. 

The technical considerations of tests are very important and 
-gPOttM-^carfy-flmeh ^eight in ^any-^4afrr-declslQiL.___J'here are some 
practical considerations as well. This final section^dlsca 
cost, format, time requirements, and similar concerns that will 
have an impact on the way tests are used and the extent to "which 
they will prevent or encourage teachers to use test results in 
decisions about children. 

The cost of tests must be considered in relationship to the 
cost of educating a student. Tests may range from as little as 
5.50 per student per- administration and scoring, up to and be- 
yond ^1.35. When one adds the cost for sufficient manuals, in- 
dividual tests, scoring and other associated' costs, such slS the 
proverbial #2 pencil, it may seem like a substantial, sum in 
order to get the information sought. While it is generally 
wise to be conscious of tlie total cost for test administration, 
much more money is spent each year on the education of students. 
fY®''y student had a test each school year that cost around four 
dollars, less than three tenths of one percent of his education 
cost would have been spent for testing. Information from this ex 
penditure can be gained about the student that is used to make de 
cisions about the most useful and produptive way to allocate the 
other 99.7 percent of the resources available. The costs of 
tests are listed in several publications and the publishers 
will be able to provide up-to-date quotes on the current prices. 
Buros* Mental Measurement'' Yearbook lists the costs for tests, 
specimen sets, technical manuals, scoring charges, and Incidenta] 
costs for tests listed. A comparison of costs in books which 
have been published for some time should give you an idea of 
how the various tests costs range, but the most current prices 
from the publishers along with anticipated future cost increases 
would be a more accurate estimate as the selection process moves 
toward a decision. 



O - 32 -qo 

ERIC 



' In the appendix is a list of some of the more commonly used 
teste along with the publishers. For a minimal cost, sample 
tests and technical manuals should be available. This kind of 
collection, in the district's test resource library should be »ac- 
cenible to teachers. The district teacher center would be an- 
ideal choice for such a library. 

In addition, a service offered by the Educational Testing 
Service of Princeton, Jew Jersey, called simptjy the "Test Col- 
lection," contains an extensive library of tests and other 
meaBurement devices. It was established as an archive for tes- 
ting and has current test information and related services 
available to persons engaged in education, research, and ad- 
visory activities. Over 10,000 tests are kept there in addition 
to files on American and foreign test publishers. Scoring ser- 
vices and systems, state testing programs, published test re- 
views, and reference materials on measurement and evaluation 
information are also available. These tests and materials are 

Available to teachers and to dlstricfts and could serve to ed- 
cate staff as well as expedite the test selection process, 
he staff of the Test Collection are available to answer phone jtnd 
uftail inquiries. Access to the Test Collection resources is also 
possible on-site to qualified persons who have an interest in 
teating. 

Current information on testing c&n be obtained through a 
publication by the Test Collection called News on Tests . ' An- 
nouncements of new tests by publishers or non-commercial sources, 
citations of test reviews, and new reference materials of in- 
terest to those involved in testing are included in the annual 
ten--i8sue publication. Tests which are not ^commercially avail- 
able^ but cited in educational and psychological literature, 
are available on microfiche from the collection in individual 
copies or sets of a hundred. The list of Major U.S. Publishers 
of Staudardized Tests 4.8 also available in pamphlet form. 
Annotated tests bibliographies in specific subject areas have 
been prepared and are available on request. 

Also located at the ETS headquarters is the ERIC Clear- 
inghouse on Tests, Measurement, and Evaluation. Annotated bib- 
liographies are available fromnhe ERIC Document Reproduction 
Service (BDRS), Computer Microfilm Corporation, P. 0. Box 190, 
Arlington, Virginia 22210. These bibliographies would serve 
;to provide the latest information on publications relating to 
the specific topic of testing that is important to the district. 
Bomm are included in the appendix to this manual. 

In reviewing tests for selection it would be useful to 
speculate on the potential use of the test and what can be done 
to get maximum information from each test available. The length 
of timm necessary to adqiinister the test will play a role in . 
how oftM a test might be used. If the time required to admito- 
Istsr, score, and get the results back from tests seems excessive 
to soliool personnel » the tests may not be used as planned. Like-- 
wise, if the teachers who are administering the test feel 



• 33 - 



excessive tine is spent in the testing process that coiipld be 
used more successfully in other educational activity^ thej^ might 
resist using the tesi. Some consideration for these attitudes 
must be made prior to the acquisition of the test. «If the 
specif icatipns of fhe test'exceed what you determine will be 
tolerated by staff and students you should reconpider your 
choice. An^alternative test might be adopted or the advantages 
that can be' gained Nby using the particular test must be con- 
vincingly explained. . ' ' ' 

Test scoring is another practical consideration that must 
not be 6verJ.ooked in selection/ A .variety of scoring services 
are available. Some simply provide a self -scoring guide for 
.teachers. Others have elaborate computer-scoring and data com- 
parisons. Ch6ose the one that meets the needs of groups and 
iiMiivlduals in the district who intend to use test results in 
inaking decisions. Test publishers should 'be/wil ling to des- 
cribe the various appropriate applications olf the scoring 
services and specify the costs associated wirk^ each . Gener- 
ally, a summary of ' the scores for pupils at sc|;iool and district 
levels can assist in making some observations about the level 
of achievement in the district as well as establishing some 
expectations for performance of the various groups. The kinds 
of iBcores, genlsrallv available include raw scores, national and 
local percentile scores, national and local ^tanine scores, 
standarid secures, and grade or age equiyl^lent scores. In some 
cases such as criterion-referenced tests a percent correct 
sgore might be provided. The reporting options are available 
for school, classroom or the individual needs. Some might be ^ 
appropriate for -parents or others concerned with the education 
process. The important thinjj to remember Is that not all score 
reports are equally comprehensible and useable. Knowledge of 
the format for presentation and the information provided can 
help to promote wider use, of the test. for the purposes intended. 

As previously stated, a systematic, thoughtful examination 
of the available options is going to produce for each district 
different procedures,' but useful results. It would be of some 
value for the- participants involved and for future test selection 
committees to have^ notes and sitmmaries of the selection activities 
Locally developed 'procedures such as these could be reviewed by 
experts and their advice sought as to possible improvements.. 

The end result of the effort , should be the selection o3f an 
acceptable test that will provide an estimate. Tests are not 
unique from assessment in general, as al? assessment provides 
estimates of the measure considered. Considering the problems 
that would occur without standards for comparison, tests pro- 
perly constructed and used- are far and away better than the 
absence of any standards. 



40 

- 34 - 



ChdDter IV 
INTERPRETING TEST RESULTS 



Once a test has been administered aj\d some kind of iscore obtained 
°^ interpretation of that score needs to be made. It is 
alxficule to place a test score in its proper perspective without 
Momm standard or basis for comparison.. 



Consider the following scores received by a student on an 
achlevencnt test and its tb-tests: 

Unguage Reading Mathematical Mathematical Composite , 

^xt9 Comprehension Concepts Computations score 

52 46 61 48 53 

These scores, by themselves, tell us nothing about the students' 
perfonMnce. He have no idea how many questions the student answered 
correctly. out of the total possible. We do not know how much mastery 
c^ov^ tm subjects this student has. We do not know how this student's 
perfocMmce cooipares to that of other students. We do not know how 
thle etudent did on one sub- test relative to another because we do 
not know lihat test scales are used, or what the scores mean. 

We orust have some system for reporting test scores that will 
provide us with useful information. 

This chapter discusses raw scores and various kinds of derived 
•core* which are used to report test results. The intent of this 
.mpter is not to provide a technical discussion of such scores, 
but to offer cn overview of the concepts behind the various scores 
to neXp the reader In ^ score interpretation. 

■ Mm icores and Derived Scores 

The raw score is sio^lv the nitmber of test items a student 
^Ittnfert correctly on the sub-test or test as a whole. For example i 
the etudent i4io correctly answered 37 of 88 items would have a raw 
.•ecnre of 37. t> 

Mm scores alone have little meaning. For example, what does 
it mma that a stiident achieved a raw scote of 37 on a test? In 
Cftfir to iatespret this score, you need to compare it to some standard. 
9Mri^«d acovet, scores that are dervied from the raw score, provide 
tdiia iewparative Information. They tell you what a stildent's raw 
l^nm mmnB in relation to the scores of other students, or what it 



- « 

<5 : ' 



41 

- 35 - 



\ 

neans In relation to a student's accomplishment of tes?Scontent. 
These two basic comparisons represent two divergent perspectives in 
testing that were described In Chapter II. 

From a criterion-referenced perspective, derived scores would 
«how to what extent a student has mastered a specific area of 
content. A percentage correot score would Indicate what proportions 
of the content domain the student has mastered. Using the example 
above, the student would have a percentage correct score of A 2 per 
cent : *^ 

U - 0.4205 X 100 - 42.05 or 42 per cent 

Also, we might Want to have a minimally acceptable score, or cut-off 
score, to indlclfite the minimal mastery level that would be accepted. 

1 

Using the Cest above, a raw score of 50 might be the minimally 
acceptable score. The student attaining a 37 would fall below that 
score anf vould not have demonstrated the sought after level of 
mast«ry. 

There are also techniques for comparing a test score of one 
student with others In a single group of test scores or with a larger 
tfCoup who have previously taken the test. Examples of such scores 
are percentiles, standard scores, stanines, normal curve equivalents, 
aqd grade level equivalents. These will be discussed below. 

The Momal Distribution 

f . ■ 

Basic to the discussion of derived scores for norm-referenced 
tests Is the concept of the normal distribution. A normal distribu- 
tion Is a distribution which is perfectly symmetrical about its 
iMan and hAs a bell shape. Scores are concentrated around the mean 
with fewer scores at either extreme. The general form of the normal 
distribution is shown belowt 




The distribution of test scores in the norm group is assumed to 
conform to the normal distribution and is assumed to be typical of 
the population to be represented by the norm group. 



- 36 - 



42 



The distribution of test scores within the norm group is simply 
a sunmary of how. students scored, that is, how many students achieved 
each possible score. An example is shown below. distribution is 
computed for each grade level or age level, and becomes the basis 
for calculating all subsequent derived scores. Typically, the 
distribution shows that most students' scores at or near the mean, or 
the average score for the group, and few scores are extreme, either 
very high or very low. 

Sample Distribution of Standard Scores ' « 



Nxunber of 
Students 




40 
Mean 
Total, Tes4; Score 

Because of the performilnce of the norm group is basic to the inter- 
pretation or teat scores, one should examine critically this group's 
validity. Several questions you might ask Inqlude: 

o Do the students partlcpa,ting in the normlng truly 
represent the Intended population? Were students 
similar to those being test Included? 

o Was there a sufficient number of students Included 
in the normlng to warrant generalization? 

o Is the performance information relatively current? 

o Was the normlng conducted at the same time of year 
that the students took the test? 

The answers to these questions can be found in the examiners 
manual for the test. Test publishers report national norms. If it 
is found that the answers to any of the above questions are negative, 
one should be skeptical about using the norm to Interpret the test 
results. 



43 



- 37 - 



One additional warning in interpreting test scores needs to be 
made. Because the sdbres derived from different tests are based on 
the performance of different norm groups » the derived scores are 
not directly comparable. For example » a score at the 50th percentile 
on th« California Test of Basic Skills is not exactly the same as 
scoring at the 50th percentile on the California Achievement Test. 
Perhaps the norm group for the former test was composed of higher 
achievers than the latter, or vice versa. In addition* the two tests 
may measure different skills. 

Percentile Scores 

One way to determine how a student's performance compares with 
the norm group is to derive a percentile score. A student's percen- 
tile score indicates the percentage of the norm group whose raw- 
scores fell below the student's raw score. For example, performing 
at the 60th percentile means that a studet^s raw score was higher 
than 60 percent of the students in the norm group. 

If a student earns a raw score of 82 items correct out of 100 
total items on a science test, this would be equivalent to the 98th 
percentile if 98 percent of the students who took the test received 
scores below an 82. 

A percentile score shows how a student ranks with respect to 
the performance of the norm group. A percentile score ranges from 
one to 99 and is derived from the score frequency. That is, the 
ntjniber of students in the norm group achieving each possible score 
is computed. Percentiles are not difficult to compute. The per- 
centile score corresponding to each raw score is calculated as 
the sxmi of the percentage of students scoring below the raw score 
«plus one half the percentage of students Scoring the same raw score. 
An example is shown in Table 1 (see following page). 

Percentiles have disadvantages. The distortion of per- 
centile scores around the mean can be a serious problem. All test 
scores are estimate. , and can easily vary a point or two. Test 
publishers, in fact, often report^an index called the standard 
error. This index takes into account chance errors and offers one 
Hasis for determining the range within which a student's true score 
probably falls. Using this statistic, a student's true score can 
be interpreted as within the interval bounded by the score the 
student rejceived on the test plus and minus one standard error. 

If t;he distribution is nortial, a percentile difference will 
not represent the same amount as an equivalent raV sccre different. 
For example, the raw score difference between the iOth and 59th 
percentiles is not as great as between the 90th and 99th percentiles. 



ErJc " - 38 - 



Table 1 



Sample Raw Score and Percentile Score Equivalents 

1 2 

Total Raw Number of Cumulative ^ Cumulative Percentile 
Score Students Frequency Percentage Score 

Achieving 
Score 



1 


1 


1 


1% 


1 


2 


2 


3 


3% 


..2 


3 


4 


7 


7% 


5 


4 


10 


17 


17% 


12 


5 


24 


41 


41% 


29 


6 


22 


63 


63% 


52 


7 


16 


79 


79% 


71 


8 


10 


89 


89% 


84 


9 


4 


93 


93% 


91 


10 


2 


95 


95% 


94 


11 


2 


97 


97% 


. 96 


12 


2 


99 


99% 


.98 




0 


99 


99% 


98 


14 


0 


99 


99% ' 


98 


IS 


1 


100 


100% 


99 


16 


0 


100 


100% 


99 



^ ^ ^ Cumulative Frequency v mn 

^Cumulative Percentage - total Number orgtudents ^ 

^Percentile Score • Cumulative percentage Below Score i percentage 

achieving score 



• 45- 

- 39 - 



Percentile scores can be misleadific because thev Hn ««^ *-«n 
you how much a student knows, but rathS? hZ thl stSden?' J^pe^fii- 
SS lAtrSeJ^Lrn*' k'*'""- ^^^"'P^^' * student coSld score L 
ItLj A?JJ« ^^^^ by answering correctly only 50 percent of the 

mighty say that the student is doing well relative 
the lest clnttT' """^ ^^""'^ °' had^Jstered 



the test content. 
Standard Scores 



r^^n^^ l^^lu^ °^ determining a student's relative standing with 
luSoS^h ?hS^ ?jr° group is to derive the student's standard Scores . 
C anniiS ^ 5! several types of standard scores, among th e m the 
5;;uIes*Sboie^,rbe?o5v£r!; ^^^^^^^^^ ^ow far a student's rIS score 
in tev^l of ILnLrH SL? ^« distance is expressed 

of hSr«?Ld^S? standard deviations is a measure 

averse Sf how " within a group are, and basically is an 

average of how much the scores in the group deviate from the mean. 

^r^r.A^%^A^°VA^ ^^^^ * student's raw score is some number of 

standarcj deviations above or below the mean? The interoret^ioJ 
tJlSSJn^^? understandable by reference to a SJ^fS?;. 

Jk scores. One of the most useful properties of a normal 

dLtribution is that when it is (JAvided into equal in?er?aU a^ 

ei^lV^'^T °^ '^f^f^^'^ff ^i'^*^^" eacHite^Jir For 

JJSSijA distribution is divided into intervals one 

?Jr?;fJ^f7^f Pe^ent of the raw scores fall within 

U oSceS if tST ^° one standard deviations from the meaS, 

tL^lll^^r^A A^^J If ''ithin the interval from one to 

i^rJi ?a?I w?tM;'i£;'/^°" the mean, and two percent of tSe ra2 
f£2 f!Jt interval from two to three deviations from 

io?h^We ^Vt P^^J'^^ges apply to standard deviation ?S?e;JaTs 
Dotn above and below the mean (see Figure 2). \ 



Figure 2 
Normal Distribution 




Test Scores 



ERLC 



To the extent that the distribution of raw score freauencies 
approaches a normal distribution, the number of standard deviations 
a student's raw score deviates above or below the mean can be given 
percentile score meaning. In a normal distribution, the average 
score corresponds to the 50th percentile, A student scoring one 
standard deviation above the mean would have a percentile score of 
84, %ihile a student scoring two standard deviations above the mean 
would have a percentile score of 98. On the other side of the liiean, 
a student's whose raw score was one standard deviation below the 
mean would have a percentile score of 16. 

The distribution of scores within the norm group is not always 
normal. The number of standard deviations above or below the mean, 
therefore, does not always directly translate into percentile 
scores. The correct standard score and percentile score equivalents 
should, however, be provided in your test manual. In any case, as 
the above discussion suggests, the number of standard deviations 
above or below the mean can furnish a yardstick to determine how 
unusual a student's raw score is. For example, if a student scored 
within one standard deviation of the mean, the score would not be 
very unu£ual, while if ^ student scored more than two standard 
deviations from the mean, then the performance would be quite 
extreme — either extremely good, if above, or extremely poor if below 
the mean. With this overview to the interpretation of deviation, we 
turn to the definition of two commonly used standard scores: z-scores 
and t-scores. 

z^scores. z*scores tell how many standard deviations a student's 
raw score is ^ from the group mean$ they usually range in. value from 
*3 to +3. A student's z-score is computed by subtracting the mean 
raw score for the norm from the student's raw score and then dividing 
the difference by the standard deviation from the norm group. For 
example, suppose a student scores 58 correct on a test. The norm 
mean is 56 correct and the standard deviation for the norm group is 
four. The student's z-score would be: ^ 

i^^. or +0.5. 

This z*score indicates that a student's raw score is 0.5 standard 
deviations above the mean of the norm group. A student who has a 
raw scor^ of 46 on the same test would have a z*score of: 

i!^. or -2.5. 

This score indicates that the student's raw score is 2.5 standard 
deviations below the norm group mean. 



" t-ecores , A student's t-score is derlvtd'^ f rom the student's 
z score. It is the result of multiplying the student's z-score 
by 10, and adding 50. For example, a student who has a z-score 
of .05 on a test has a t-score of 10 (0.5) + 50, or 55; while a 
student with a z-score of -2.5 on the test has a t-score of 10 
(-a#6) ♦ 50, or 25. Here, then, %n contrast to the z-score scale 
where the mean is one and the one standard deviation is equal to 
one, the mean of the t-score is 50, and one standard deviation 
unit is 10. 

t-'scores indirectly indicate the number of standard devia- 
tions a raw score deviates above or below the mean. A t-score 
of 70 indicates that the student's raw score is 2.0 standard de- 
viations ab6ve the mean, a t-score of 35 indicates that the 
student's raw score is 1.5 standard deviations below the mean of 
the norm group raw scores. Given a t-score, one may determine 
the number of standard deviations the student's raw score de- 
viates above or below the mean by deriving a standard score \ .lich 
Inj^lcates this directly, that is, by deriving the student's z- 
score. Since 'a t-score Is equal to 10 times a z-score plus 50, 
a z-score Is equal to a t-score minus 50, divided by 10. So, 
given a student's t-score of 70, the student's z-score is (70-50) 

10 

or -9*2.0. And. given a student's t-score of 35, the student's 
z-8Core is (35-50) or -1.5. 

10 n ^ 

The advantage Df t-scores over z-scores is that they avoid 
negative numbers and decimals, which makes calculations easier. ' 
It should be noted that many other standard scores exist. For 
exaoqple, the Scholsatic Aptitude Test uses a standard score 
scale where the mean is 500 and'one standard deviation equals 
JLOO . Normal vajji^es for the scale„ thug^ range f rom 200 to $00 . 

Stanines 

Stanines are derived scores with a mean of 5 and range from 
1 to 9. .They divide a normal distribution in nine parts. 

Stanine scores, like standard scores, indicate how far a 
student ^8 raw score deviates from the norm group mean. The dis- - 
tribution of raw scores in the norm group is divided irfto nine 
Intervals. The inner seven intervals (stanines 2-8) are one- 
half standard deviations wide, and the outer two intervals 
(stanines 1 and 9) are greater than one standard deviation 
(see Figure 3). Stanine 5 straddles the mean and contains all 
raw scores within 0.25 standard deviations on either side of the 
mean. 

The remaining stanines are evenly distributed above and 
below stanine 5. Stanines 6 and 4 contain, respectively, all 



48 

42 - 



raw. Hcorps 0.25 t4> 0.75 stnnclard dcwiations above' and Ix^low tlu* 
mcMin, while nlanlno 1 conLuins all raw Kcorrs mo c than 1.75 stan- 
durd dovlulionH below the mean. When a normal alHtrlbut ion is 
divided Into stanlncs, each stanino contains a fixed percentage 
of raw scores: 20 percent of the raw scores fall within stanine 
5; 17 percent of the raw scores fall within stanine 6 and 17 per- 
cent of them fall within stanine 4; 12 percent of the raw scores 
fall within each of the stanines 7 and 3; 7 percent of the raw 
scores fall within stanine 8 and 7 percent of them fall within 
, stanine 2; and, finally, 4 percent of the raw scores fall within 
each of stanines 1 and 9. ^ 



Frequency 




Figure 3 
Normal Distribution 



Stanine Scores 



20% 




of 


17% 


Stu- 




dents 






Stanine 



•1.75 -1.25 -.75 ^.25 +.25 +.75 +1.25+1.75 
S.D. S.D. S*D. S.D. S«D. S.D. S.D. S.D. 
12 34567 89 



Many have recommended the use of stanine scores rather than 
percentile scores. Because stanines cover a range of percentile 
scores, they tend to be more stable estimates. 

Normal Curve Equivalents 

Normal curve equivalents are a relatively new derived score 
and have been used in ESEA Title I evaluations. Like percentiles, 
normal curve equivalents range from 1 to 99, with a mean of 50. 
Normal curve equivalents, however, have a standard deviation of^ 
21.06 so that noniial curve equivalents of 1 and 99 correspond r^o 
the l8t percentile and 99th percentile, respectively. One dis- 
advantage of normal curve equivalents is that they can be easily 
confused with percentiles. 



49 



Grade and Age Level Equivalent Scores 

Grade- and age- level equivalent scores attempt to tell where 
a student's raw score falls with respect to the average performance 
of students at various grade or age levels. The average raw score, 
the median of students at each grade or age level, then defines the 
grade (or age) level equivalent for the test. Grade or age level 
equivalents for grades that wtfre not tested .are computed by Inter- 
polating based on the trends in the data. 

If a student s score on a test was the same ks the median.^ 
scpre for all beginning second graders, then the student's grade 
equivalent score would be 2.0. If the student scored the same as . 
the median score for all beginning third graders, a grade equiv- 
alent score of 3.0 would be assigned. As mentioned above, inter- 
polation would be used to assign grade equivalent scores in between 

Although grade- and age-level equivalent scores have great 
intuitive appeal, they suffer from a number of methodological 
problems. A primary problem is the way scores are Interpolated, 
that is, how scores are derived for levels not tested, and how 
scores between tested levels are computed. Using the example 
above, what does a grade equivalent score of 2.7 mean? Or, what 
does a grade equivalent score of 4' 6 mean if the test was not 
given to students above the thxrd grade? Interpolation and 
extrapolation axe imprecise. 

Also, small sampling errors can be compounded into large 
errors in extrapolation and then make grade equivalent scores 
very misleading and inaccurate.^ 

In addition to methodological problems, age -and grade equiv- 
alent scores are often misused. If a fourth grade student obtains 
a grade-level equivalent score of 7.0 in mathematics, it does not 
mean that the student can do what a seventh grader does. It only 
means th*t the student got the same raw score on the test as the 
average seventh grade?: participating in the norming or that the 
score was obtained through extrapolation. A grade-level equiv- 
alent scor6 says nothing about the content a student knows. The 
items the fourth grader answers correctly to obtain a seventh- 
grade equivalent score may be quite different from the ite;ns the 
seventh grader answers to obtain the same score. The test given 
^he fourth grader most likely does not include many of the skills 
or content that would be expected of a seventh-grader. 

Another common misuse of grade- or age-equivalent scores is 
to use them as standards and assume that all students should be 
performing at least at their own grade level or age level. Given 
the way these scores are calculated, one can expect half the stu- 
dentb at any age or grade to fall below the chronological eauiv- 
alent. ^ 



- 44 . 50 



Because of the methodological problems of grade and age 
equivalent scores and their frequent misinterpretation, many 
prominent measurement experts have suggested that these types of 
scores should not be used. 



\ 



\ 

\ 



\ 

\ 



\ 



51 



Chapter V 

APPLICATION-^F STANDARDIZEIXJ^ISTS-TQ YOUR PURx>OSES 



In the prevloiiR chapters, Information on the various kinds of 
tests was presented, •ctors which should be considered in the 
selection of tests were discussed, and in the last chapter the 
various results of tests were presented. In th^s chapter, the 
application of the test scored to classroom and district use is 
presented. That interpretation is based on the question of *'How 
are the results of the tests to be applied to the improvement of 
Instruction?" In seeking to answer this question other questions 
niay develop. What does a district release to legislators, citi- 
zens, parents about the performance of students on tests? These 
and other questions about what information is necessa]>y to re- 
lease or use may not be provided expressly by the test results. 
To the uninitiated it might seem shocking that fully half of 
the students in the district scored below the fiftieth percentile 
on a test. If the parents of the district would be concerned 
about such information, then someone in the district must make 
an effort to educate the parents or others drawing similar con- 
clusions. This is certainly one of the responsibilities that 
accompanies testing. Persons associated with and concerned with 
the testing of students will want to know the results. Before 
addressing^ their concerns first see to it that the teachers have 
the information necessary to digest the results and apply them 
to instruction. Teachers are the front line conti^ct with students 
and parents, and as such have priority fbr test information. 

An additional responsibility to correctly assess the resources 
and time necessary to evaluate test results and incorporate them 
into the education 'process -exists. Some person in the district 
must be charged with the responsibility for accomplishing this^ 
task. Those involved need to know what is available, and what 
Is expected of each participant in the testing program. Decisions 
about testing should be carefully weighed in lighr^of the infor- 
mation developed in such an analysis. The rea<|itions of various 
interested groups should be considered so that a concensus is - 
reached or lacking a concensus, a policy is adopted with authority 
to implement. 

Careful monitoring of the testing program should be provided 
for in the implementation of the policy. Users and those seeking 
information regarding performance can abuse the results of the 
best, most carefully adopted test. 

There is a demand for testing, but a considerable effort is 
required to properly implement the program. It is a responsibility 
not to be taken lightly. ^ 




Educating the Test User 



With this charge for responsibility in mind interpretation' of 
test scores can begin. How much effort will be expended to get 
the test results to the people who will be using them? Someone 
with a realistic idea of how much time will be required for teachers 
and ado^nistrators to process the results of the test should make 
such estimates in advance of the test selection. Then factors 
affecting the processing should be explored. In the best teaching 
form, it should be gone over and over again until it is right. 
While the school year is in progress is a good time to assess the 
level of understanding held by the staff. Those who would benefit 
by courses or inservice %ork on test use and application should be 
given the opportunity to improve their skills. Information about 
the test selected and its advantages and properties should be 
available at the school, the union office, teacher centers — wher- 
ever thr^e or four are gathered — so that the widest access possible 
is available. 

There is absolutely nothing wrong with offering incentives 
to teachers to learn about the test they will be using. Honey 
is nice. So is released time, credit toward continuing education 
hours, etc. If as much money were put into getting people to 
properly use tests i^s is spent on the tests themselves, it would 
still be less than one percent of the per pupil cost. In these 
times of inflation this is an exceptional bargain. Remember that 
the decision about how enthusiastically people will be willing to 
work on the testing program or any other part of the school 
program needs to be a Joint decision involving teachers, aditiinis- 
trators, and board members. It^is about time we tried a way that 
responsible people could agree might have a chance for success. 

Once t|)e amount of effort and resources that will be put 
into the testing program is decided, some other things have to 
be reviewed. What was it that the district decided was their 
purpose for testing in the first place? In conjunction with 
this some restatement of the limits of the test selected should 
be mad6. The format for presentation of the data,' either hand 
scored or computer generated, should be identified. When and 
where will the data be available and to whom? The answers to 
these questions can be presented in fairly simple and direct 
terms. Plain talk is the key to getti^ig people to listen. 
Some ^roup in the district, perhaps the persons who served on 
the selection committee, should prepare all of the information 
that is necessary for presentation of the test to the staff of the 
district. Someone in authority should see to it that provision is 
made for the staff to receive the information. The following are 
possible considerations that might be made in the Implementation 
of a testing program. 

The chief officer of the district responsible for testing 
should have some' procedure available to get the information to 
teachers. This would not preclude an agreement with the union 
to have meetings to offir inservice during the school day or at 



some other time agreeable to the staff. Participation of teachor 
representatives on the test select ion committee as well as a 
survey of needs during the selection process will produce the 
kind of enthusiastic cooperation necessary to make the program a 
success. The preparation of the information which will be presented 
should include statements to the district's purpose for testing. 
If the district sought to have achievement tests so that generali- 
zations about developed abilities could be made, then it should be 
presented in that way. The basic information available to teachers 
should Include at least a manual for the test and comments by th(' 
committee about the appropriateness for use in the district should 
bo made. .The district may waat to indicate that the test was 
checked for reliability and validity. Statements about how well 
the content of the test matched the curriculum and supplementary 
materials used on level in a majority of the classes in the 
district would be appropriate. Specifics could be included in 
reference tables without bogging down the reader. It would also 
serve as a check for teachers as to what was being taught in tne 
various classrooms of the district. A comparison of the basic 
skills and content covered by the test could be contrasted with 

^the curricjlum used in the district. What was tested that was 
normally taught and vice versa would be information that could 
help relieve some of the expectation of teachers for test scores. 
It could also serve to point out the areas that were usually 
taught that were felt by the district to be important enough 

•that they needed emphasis. Such a summary comparison does not 
relieve teachers from the. respionsibility of examining test items 
to determine if there is content validity for their students. 
If lack of content validity is noted some procedure for notifying 
the district should be specified. 

Some discussion of the norming group and the characteristics ^ 
of the district could be included. Comments about the significant 
differences would serye to point out what reasonable expectations 
should be made about st,udents' performance on the test. Special 
groups of students or i^tudent populations might be identified so 
that teachers of those students will not be caught unawares by 
the group's performance. For example, if students with limited 
English abilities are predominant in some schools or classrooms, 
then knowing the advantageb and limitations of the test for a 
particular group will allow teachers to .adjust their expectations 
as necessary. This will help with the morale problem that accom- 
panies unexpectedly lov/ results. 

i 

Teachers should know the format for the score reporting. 
They should also know if others in the district will be reviewing 
individual or group data about the students so that duplication 
or of Tori can be avoided and proper attention to the scores can 
be given. Teachers should not find out from the headlines in 
their local paper that the class they are teaching is far below 
the district or natioh^l average. Such information should be ^ ^ 
first presented to and evaluated by the teacher as suggested 
previously so that sugge^itions for improvement^ pan be included 
m any referenpe to district scores. At no time is it appropriate 



to use tho scores of one school to compare the scores of another 
school. Such comparisons must be made in relationship to some 
standard that has meaning. The differences for each school over 
time must be considered. The scores are best presented in rela- 
tionship to the facts of the situation. 

The sp goes for comparisons of student scores with the , 
particular teacher or class. By their nature, the standardized 
achievement tests must test knowledge learned in previous grades 
or years as well as the current one. The notion that arbitrary 
goals of improvement can be set without considering wliere tiie 
student began is ridiculous and wrong. Test scores are not in- 
tended for sucn purposes. Appropriate use of the scores might 
Include a review with the teacher of the strengths and weaknesses 
of the individual student as shown by the test scores and other 
measures such as grades, or by personal observation. Allowing 
for e'^^or and related factors, the generalizations that can be 
made should serve to Improve and focus the Instructions where 
possible. In reviewing the test scores, students may score high 
on general reading, vocabulary, listening skills and auditory 
portions of the test. If for the same students markedly dif- 
ferent performance (two or more stanlne scores difference) on 
the math or science is observed, it might be attributed to some 
problem other than comprehension of the vocabulary on the test 
oi lack of reading skills necessary to complete those sections 
correctly. Teachers should be encouraged to lOok to the test 
items and determine what kinds of skills and content were missed 
in class. Math computation might have been presented in a ver- 
tical format but on the test only horizontal format problems 
were Included. This unf amlllarlty with tLxs format might be 
disguising developed abilities. This kind of item review would 
produce some estimate of the source of the problems. By ob- 
serving the problems the teacher should be able to indicate the 
resources necessary to overcome the deficiencies. The support 
, for the teacher's recommendations should be followed and acted 
upon. If such recommendations are not met, the district must 
reconcile itself to the fact that limited change is possible. 

It is Important to remember that the tests are Inexact and 
that s<:>me things beside student performance or test characteristics 
may affect scqres,. If abnormally low scores are discovered in a 
particular class of students, some check on the administration 
of the test would be in order: a review of the conditions of the 
class, the instructions given, the physical conditions of the 
testing room, and similar non-test or student factors should be 
considered. * Unusual attendance, activity, lllneSo, or acts of 
nature could account for the differences. Even though tests are 
supposed to be able to accommodate some variations from the nor- 
mal situation they ma^ not tolerate extreme conditions. 

Once teachers have the information in hand about the test, 
it would be useful to have someone who 1*? not a direct supervisor 
or evaluator available to answer questions. The press of normal 



school busin(*ss Is such that counselors, principals, and consul- 
tants are not always available to spend the necessary tim(? to 
fully discuss the questions raised. Some teachers may fear that 
they will appear less than competent if they raise a particular 
question. If such questions remain unanswered, serious errors 
in interpretation of test scores may orcur. Designate someone 
who might be available after school hours or during the planning 
periods of the school day for some period of time prior to re- 
turning the test scores. Teacher centers or the union office 
could also prbvide information to those with questiolis. These 
kinds of additional resources should not be overlooked in the 
presentation of test information or the clarif icjit ion of question 

The directions about the use of test scores should include 
some information about how the scores will be arrayed. The 
available methods of computer printing and analysis can eliminate 
much of the fatiguing busy work previously performed by the 
classroom teacher. Where available, this option should be chosen 
Samples of the scores which will be presented along with the 
forms for analyzing the data should be part of the presentation 
of information. The class record, a summary of all student 
scores and norm information, is a valuable tool for teachers to 
use. Item review to identify those skills or content areas most 
often missed should be included as well. The manual for some 
tests will provide sample worKt leets and forms which can be al- 
tered to fit the district's needs. Uniform preparation of these 
kinds of analysis sheets is imperative and teachers should be 
told clearly what is needed in the way of cooperation on this 
matter. While the test manual may be helpful, the addition of 
the local interest in a successful testing program cannot, be 
ignored. Teachers must have a manual and a test to refer to so 
that they can begin to examine the items for form and content. 
Teachers need to be familiar with all aspects of the test if they 
are to gain confidence in the test and beg.n to use it to sup- 
plement other criteria in forming educational decisions. 

When scores are returned to the district, the distribution 
of the scores and the accessibility to them by teachers is im- 
portant to allow for. Careful planning as to the location of the 
scores and any security necessary to protect students from un- 
authorized use of their scores must be thought through. Time 
to assimilate and prepare the scores in a meaningful way must be 
designated and provided. When the scores are distributed to the 
school are they tucked away in the counselor's office or can 
teachers get th m from another source? Teachers should know when 
scores are available and how to get access to them. Too often in 
the past the test scores have been returned too late to be of 
value to the teacher who has tested the pupil. This is essen- 
tially true in spring testing situations. In fall testing, the 
scores often come back too late to use or incorporate into 
planning for individual student programs. The district should 
clearly state precisely what it hopes to have the teacher use the 
scores for and what timetable is to De followed 



- 50 - 



Available Test Scores and Their Properties 



In the previous chapter, the kinds of scores reported wcia 
Introduced. Scores are available In several forms based on the 
raw score performance. Scores may be expressed as raw scores 
(the number of Items answered correctly), percentile ranks, 
stanlnes, grade equivalents, and scaled scores. Some test pub- 
lishers may! use still other means to report the scores. In. any 
case the raiw scores are the one thing that tests have in common. 
How these raw scores can be transformed Into other comparable^ 
^Utilts of measure Is something that should be specified by the 
district and should match the Intent of the publisher. 

Percentile Ranking 

If percentile ranks are to be used, teachers should under- 
stand that the percentage of cases In a distribution at or be- 
low any given scores value determine the percentile rank of 
that score. Since percentile scores do not provide equal units 
of raw score measure it would be helpful to remind teachers 
that near the center of the distribution the scores will bunch 
up. The difference of a few raw score points may make a large 
difference in percentile ranks at the middle of the scale and. a 
small difference near the ends of the scale. An example is pro- 
vided in Figure 5.3 and what the effect of this feature of per- 
centile scores does is illustrated. Averaging percentile ranks 
is difficult because of the difference in the measure of value 
for each unit and interpretation of the magnitude of difference 
between percentile scores is difficult. This is not to say* 
percentile ranks are useless, but this characteristic should be 
noted and accounted for by the teacher. r The publisher should 
identify percentile ranks for the various times of the year in 
which tests may be given. A raw score for a fall administration 
will yield a different rank for the spring administration, so 
the correct table should be identified and used by the teachers. 

Stanlnes* 

^Scores may be reported in stanlnes also. Stanlnes are groups 
of values on a nine point scale of normalized values. A ranking 
of. one is the lowest , nine the highest, and five represents the 
* average performance for pupils in the norm group. They ^re tied 
to the percentile ranks for a normal distribution and c^n be ob- 
tained directly from the computation 'of percentile rapks. Since 
the stanlnes are equal units the bunching effect of percentile 
ranks is avoided* Differences in stanine ranks are comparable, 
with a difference of si^r and ej^ght being similar %o a difference 
between four and six. By using stanlnes, teachers can avoid 
making distinctions which are too fine to be acjcommodated by the 
test. The relationship between stanlnes and percentile ranks is 
illustrated 'in the figure below: 



- 51 - 



57 



ERLC 



9 



Figure 'S.l 



> 


7 


'A 

/ 

\ 

12 


17 


20 


17 


12 


7 


J 

4 


Stanine 1 


2 


3 


4 


5 


6 


7 


8 


9 



Percentile 4 n ^23 40 60 77 89 .96 



By observing the differences in stanine scores on subtests of 
batteries, the teacher can identify those areas in which a student 
is excelling or having difficulty. For most cases a difference 
of two or more stanines in a score between tests is* said to be 
cause to exajmine the performance in the exceptional area. One 
problem with the use of such measures as percentile and stanines 
is that change over time is not usually identifiable. 

Grade Bguivalent Scores 

Grade equivalent scores will not be much help in showing 
advanx^^ment over time either » unless longitudinal data is avail- 
able. Even then different students will perform differently » 
with only those near the median showing an average growth of one 
year for each school year of education. Grade equivalent scores 
are determined b^ first translating all spring ai\d fall raw scores 
medians for a test each level into a common level. These raw 
licore points are plotted on a graph* and a line is drawn that best 
fits the points. It is possible that only a few points will be 
identified and other points are assumed to fall on the line of 
best fit. These points along the line are called grade norms. 
The raw score correspondijdg to any grade equivalent indicates 
the appropriate score that would be made by the pupils in the 
standardization programs at a specific point in the grade. If 
a test is given in a district that begins earlier or later than 
the frame of reference for the test standardization* then teachers 
should be notified and the necessary adjustments made. 

Grade equivalents at best describe how a student at a par- 
ticular grade level would do if he took the test for the level 
tested. For example i if a test is given to fourth graders in 
October (designated by 4.1) and a seventh grader performing at 
the average level for seventh graders took the test» the score 



- " - 58 



s, derived would be 7.1, indicating October in the seventh grade 
year. If a fourth grader took the test and received a grade of 
7.1, it would mean he achieved an identical score, not that he 
would be able to perform math like seventh graders. Computation 
and division and other content areas normally covered in the 
^ higher grades would not have yet been taught to the fourth 
grader. It would be inappropriate to place him out of grade 
level. In fact, any score that is a grade equivalent more than 
two years beyond the level should be referred to as well above 
average and to attach more significance than that would be 
misleading. ^ 

^ Scaled Scores 

The score that may be the most useful in comparing students' 
progress over time is the scaled score. For tests on all battery 
Revels, the scores may be comparable as in the case of the Met- 
ropolitan or the Stanford Achievement Tests. The publisher 
should identify this propferty of the derived scores if available 
Once raw scores kre converted to scaled scores, the battery 
level and the form can be ignored in further interpretations. 
Batteries of the test are equated and forms are made equivalent 
in going from raw score to scaled score. Features of the scaled 
score allow comparison of achievement over time and for this 
H reason might be more useful to a teacher who is trying to gen- 

eralize about the student's developed abilities than other kinds 
of scores. 

In order to compare one test to another, for example a 
math test to a reading test, the scale chart provided by the 
publisher must be consulted. There is no direct comparison 
without such consultation. In addition, the scores have no 
meaning by themselves and cannot be used in interpretive de- 
cisions. Percentile ranks and grade equivalent scores offer more 
in this activity. 



ERIC 



- 53 - 

59 



( 




1 o 


Convert raw scores to derived scores such as percentile ranks 
stanines for later use in comparisons. ' 1 


1 ^ 

1 L 


Use in statistical analj^ses when computing correlation of 1 
coefficient and similar procedures. 1 


1 Percentile ranks 1 


1 ^ 


Use to compare pupil's standing ofn a test or ranking in re- 
lationship to a national or other group standard. 1 


1 ^ 


Use to Compare results among test batteries. 1 


1 ^ 


May be a choice for reporting test results to parents, 1 
pupils, and others who are not familiar with testing and 1 
measurement . ''1 


1 Stanines 1 


1 ^ 


Same as percentile ranks dIus thev mav be ns^H for- mflUnp 1 
comparisons with some other variable in performance such 1 
as general learning ability. 1 


1 Grade Equivalents ^ 1 


1 ^ 


Use for interpreting performance of groups such as an entire 1 
class or grade. 1 


I ^ 


Use for measuring advancement over time when longitudinal 1 
data is available and relative level of achievement is 1 
accounted for in data. 1 


1 o 


Use for determining relative individual achievement when 1 
consideration is given to the differences that may be assoc- " \ 
iated with high, average^ and low achieving characteristics 1 
of th^ student. 1 


1 Scaled 


Scores ' 1 * 


1 ^ 


Use to study achievement over time as data is collected and 1 
reviewed. 1 


1 ^ 


Use for interpreting results whea testing is out of level . 1 


1 ^ 


Use for most applications when conducting statistical 1 
analyses. 1 


' o 


Use to compare different forms and batteries of teLts. 1 




no 




- 54 - 



Using Test Scores 

Finally all of the education and preparation will be completed 
and the scores made available to the classroom teachers. Before 
beginning to record scores onto a summary sheet it is useful to 
see if the scores seem right. If based on the knowledge of the 
teacher about the student's past performance the scores are gen- 
erally what was expected, then further interpretation can be 
pursued. Stanine scores for students are useful for such initial 
examination because of the ease with which assumptions about 
*'above average, average, and below average" categories.can be 
made. If a student is scoring in the second stanine on math 
portions when you realize the student to be generally superior 
in math, then the test performance should be questioned. The ^ 
fault may lie with the content, presentation format, or other 
variables. The physical condition of the students, the adminis- 
tration of the test, confusion on the marking of answer sheets 
. migl\t account for the marked difference observed. If many students 
score lower than anticipated, perhaps it is due to a lack of co-- 
ordination with what was taught and what was tested. This should 
be checked against the validity information developed by the 
district test selection committee. The match between curriculum 
and content should have been identified and adjustments suggested. 
Test publishers will often identify the content of the test items 
in the technical manuals. These sources should be consulted to 
determine the extent of a match in circumstances that cannot be 
attributed to individual student performance. 

Sorting 

The apparent sorting of students into categories and smoothing 
of individual differences is a function of test interpretations^ 
of this type. Critics of testing will decry such sorting and 
labeling as damaging to students. A few statements in support 
of sorting are appropriate arid offered for your co|isideration . 
Sorting is neither good nor bad without reference. 



o While teachers may be aware of the differences between 
twenty eight or thirty students, in practice these dif- 
ferences are minimized and groups are formed either con- 
sciously or unconsciously as teaching occurs. 

o The time constraints on teaching a course, limitations of 
books and materials available, and the physical character- 
istics of the classroom all lend themselves to sorting. 
To some extent efficiency of effort requires such sorting. 

o Sorting is appropriate when making some decisions about 
presentation of information and can lead to efficient 
teaching methods which allow teachers to be more flexible 
and Individualize instruction with the remaining time and 
resources. 



er|c . ♦ 



- 55 -61 



o Not evei^one's score on a test is an indication of per- 
formance ablli^jr. Terms developed in the sorting should 
be applied to the context in which they were developed. 
Knowing you were below or above average is meaningless 
in most applications unless a relationship to a standard 
of either* a group or the individual is expressed concur- 
rently. 

o The performance level in relationship to a group standard 
or individual standard can help to identify an appropriate^ 
activity for the student; in which the greatest opportunity 
for educational achievement can occur. 

Preliminary Activities 

If a review of the scores shows that most students are at the 
high end of the scale it is likely that another form of the test 
should have been used. If everyone scores in the upper ranges, the 
differences are masked and the test was probably too easy to be of 
use in determining strengths and weaknesses. Out of level testing 
might be pursued to gain more definitive information. 

References to the level of ability of the school or class from 
previous testing or other measures for student ability may help to 
identify the student who is achieving out of sync with previously 
demonstrated abilities. Not only should the lowest achieving or 
the highest achieving students be observed in light of school per- 
formance or previous history, but all students should get. con- 
sideration. The students in the middle stanine ranges may well 
be some of the ones that could have been expected to achieve in 
the upper ranges. Too often these bright students are missed 
when their performance is substandard for their ability. Students 
who score well above what you expected may be^ exhibiting a per- 
fectly natural behavior in the structured testing situation and 
have an entirely different set of behaviors in the regular class- 
room setting. The initial review can help establish a level of 
confidence in the test results, so thart further specific inter- 
pretations of the scores can be made. 

A relationship between reading ability and test performance 
can be expected. Reading ability is required for many of the sub- 
tests and a student who scores low on reading and hig^h on math 
computation may be delivering a message about needing help In < 
reading. High reading and math application scores (thought pro- 
blems) followed by low scores on math computation subtests may 
indicate a need for math improvement in the computat.ion area. If 
Inath questions that rel\^ on reading ability can be handled suc- 
cessfully then the problem lies more likely in the student's 
math computation skills. 



62 



- S6 - 



student Scor(?d Summary 



The individual pupil record form provides teachers with tho 
inlorn^ation from which most other observations will be made. 
Figure 5.2 is a sample form that might be presented m report ing 
.student scores. 



Figure 5.2 Student Score Summary 



Niimc? Larry Rudner 



Grade 



Teacher Anne Goldblatt 



Date of testing I0/17y80 



School Wq^rd Elementary 



District Unionville State Mass. 



T^st 



Reading 



Muth 



Language 



Science 



SociuJ 
Studies 



Basic Battery 
(R & M & L) 



Number 
Poss- . 
ible 



55 



45 



55 



40 



40 



155 



Compl('to nat- 
lery (nasio & 
S Si SS) 



235 



Number 
Right 



3^) 



II 



107 



I Si 



Scaled 
Score 



653 



^1/ 



5'iU 



50L> 



Grade 
Equiv. 



^1 



5.0 



'■1 



3.0 



2.i 



i-7 



Percen- 
tile 
Rank 



/r 



Jtanine 



;5 6 



4 5g) 



4 56 



]]RLC 



- 57 



B3 



The raw scores showing the number correct on each test are 
entered in the appropriate space, here labeled "number right.' 
If hand scoring is done by the teacher, a key is usually provided 
which allows a quick visual check to determine if the correct 
answer is marked. These are totalled for each test and generally 
identified as the raw score. While marking the booklet or score 
sheet it would be useful to identify which choice was made by 
the student in each wrong answer J This information used later 
can assist in determining something abwt the choice in relation- 
ship to behavior or abilities. For example, a student who con- 
sistently marks the answer on a math test which has the decimal 
misplaced may need review or emphasis on decimals. To know only 
what has been seen as correct offers a limited view of the stu- 
dent's performance. 

Once the raw score is entered and the totals of the basic 
and the complete batteries tabulated, the raw score can be con- 
verted to a derived score. The derived score may be calculated ^ 
as demonstrated in earlier ctfapters of this manual, but the most 
practical way to proceed is to use the tables generally found in 
the Instructor's manual accompanying the test. 

When using the tables i^i the manual, be certain that you 
have the correct form of the test and the correct norm for the 
time of year in which the test was administered. Also be conscious^ 
of the fact that many, of the derived scores are based on mean or 
median performance and are not in and of themselves indicative of. 
desired achievement levels. 

In the example given in Figure 5,2, three derived scores 
will provide some insight into the performance of the student as 
shown on this test. From the percentile rank column, notice that' 
the student was in the 44th percentile on Reading and the 50th 
in Social Studies. These two scores are fairly close to the cen- 
ter of .the distribution of scores and will be considered later 
to determine if they represent significant variations in perfor- 
mance or are a function of the ''bunching** of scores that some- 
times occurs when using percentile rank. The Science score at 
18 is in the lower third of the scores and the 66 in Language is 
at the upper third dividing point. The assumption is that the 
student is better in language skills than in math skills, but 
more importantly you can see the highest and lowest score ranges. 
Notice alsQ that the difference between the highest score and the 
others is similar to the difference between the lowest score and 
the others near the center. Both Math and Language scores should 
be examined more closely to see what relationships might exist. 
By looking to group performance information and other student 
records, some estimate of how close to expectations the student 
performed can be determined. In the case of the low science 
score it might be to the teacher's advantage to note the score 
and later analyze the specific items missed. If specific con- 
cepts were taught in the curriculum, the requirements for reading 




and vocabulary on the tost w(»re different, thon in ord(»r to do 
roasonably well on the test some changes in the curriculum con- 
tent might be made. An alternative would be to find another 
lest for science or a battery of tests that -has content closer 
to the curriculum. The administration of a locally developad 
objective or criterion referenced test might be a useful beginning 
point . 

P<M'CiMUil(^ Midpoint Punching 

A rcvi(»w or th(» scorers nnir \ Uc midpoint oT pc^rcc^nt i 1 rniik-^ 
inK^ <'aii hc^ of value lo determine if a wide* numb(»f of scows ;n-i» 
sc?parat(»d by only an answer or two near this per Tornianct^ Ievc*l. 
Vnm thv tables provided by the test publishc^r, thv following', 
can be observed. The Social Studies score of 27 produced > a per- 
centile ranking of 50 for this particular test. The other scort^s 
produced percentiles as shown. 



• ^ . Figure 5.3- 

Relationship of Percentile Rank to Raw Scores 



Soc'iiil Studios 

Riiw Sc-.oro Scaled * 
N Correct of iO Score 


Percent i 1 e 
Rank 


Difference from 
Student * s act ual , 
Score 


24 


4 74 


34 


Score -3 


25 


4^8 


40 


Score ^-2 


26 


502 


44 


Score -1 , 


27 


516 


'50 


Student Score 


28 


{530 


54 ^ 


Score +1 


29 


544 


64 


Score +2 


30 


558 


70 ' 


Score +3 



I 

•1 

I5y markinj,': one* :msw(*r di I loreiil I y . Wxr p<m-((mi 1 i I c r:uikinp. 
Would liuve changed upward to tlu* 5^1 1 h p(n'c(»nt i !<» or downward In 
the 4^th pc*rct^n t i 1 A change* of t.AO answ(*rs would produce^ a 
rank change* from 10 points J<>W(*r up to 14 i)oints higher. Thi-e<' 
answers different would have moved the stud(^nt from llu* middU^^ 
third to the upper or lower third of the rankings. 



ERIC 



- 59 B5 



The effect of a one to threfe point raw score difference is 
less drastic for the total Lattery. In this test the percentile 
ranking of 40 for the whole battery is affected only two points 
upward for three additional correct answers and four points down- 
ward for three additibnal incorrect answers. At the midpoint 
in the scale three answers either way produce a change of plus 
or minus four points. Such properties of percentile rankings 
should be kept in mind as the scores are reviewed. Decisions 
maat^ based upon percentile rankings should account for the swings 
produr ed by a few answers at various points in the scale. Gen- 
erali nations rather than tightly drawn conclusions are better 
uses of the scores in this instance. 

Cluster Analysis 

The use of clustter analysis may help determine the specific 
strength and weakness of a student within a subtest. Often the 
publisher will provide information about the national median per- 
formance on the items in a cluster of related items on the test. 
Figure 5.4 represents the mathematics cluster on which a student 
scored 26 correct answers out of 45 possible. For this exercies 
the number in the lower right hand corner of each box will 
represent a norm referencing to the national median performance. 
A discussion of criterion referencing will follows, The number in 
the upper left hanu corner is for the student's score on the items 
in that cluster. 



Figure 5.4 
Cluster Analysis for the Math Sub-test 

Mathematics 


, Item 
















10 


Numeration 


Geometry & 
Measurement 


Problem 
Solving 


Operations: 
Whole Numbers 


No. 
Possible 


10 


10 


10 


15 



Referring to the manual a table appropriate for the grade level 
and time of vear for test administration showing how the student 
did in relationship to the average for the norm group can be 
made. Also some indication of how difficult a cluster of items 
was for most students can be seen. If the average number cor- 
rect was four of ten it would likely be more difficult an item 



66 

- 60 - 



cluster than one in which students answered eight of ton cor- 
rect. In the example in Figure 5.4, the student's strengths 
and weaknesses shown in the cluster indicate that the student 
is not evenly skilled in the various operations of mathematics. 
In both' Numerat ic^n and Geometry and Measurement the student was 
above the national average.. Problem Solving was particularly 
difl'icult lor the student. The score was just over hklf of 
the average score for all students taking the test. Assumptions 
on general skills within the subtest can be made by the toachor 
providing that the items in the cluster match closely the items 
covered in the curriculum. By keeping track of the performance . 
of various students in clusters some indication as to whether 
individual attention must be given- the student is shown or per- 
haps other changes in the curriculum or teaching methods would 
be in order. 

Criterion referencing can be applied to the same kind of 
cluster analysis by substituting the teacher's criteria or some 
other criteria for the norm referenced scores. If a teacher 
felt it was important for a student to pass all items -in the 
Numeration clyster because it was an essential skill for later 
instruction, then the criteria might be equal to the number pos- 
sible, *ln this case ten correct out of ten. If the teacher felt 
that the minimum acceptable performance level was 80 percent 
corrocft , then a score of 8 in the problem solving cluster shown 
in Figure 5.4 would be the goal. 

Certain kinds of tests provide results which*' are more 
appropriate for determining detailed information about competence 
on specific learning objectives. The Reading, Mathematics, and 
Language tests would be usually expected to fall into this cat- 
egory. By studying the responses of students to the questions 
some indication of strength and weakness for the pupil or class 
can be determined. ^ More importantly, a study of this kind can 
indicate where more diagnostic assistance is needed or where 
some Judgments will be required about how to proceed with the 
student. To assist in the time consuming task the option of 
computer assisted scoring would be useful provided such a service 
Is available. Certainly one would not select a test that was 
invalid or unreliable because of the scoring feature. If tests 
are comparable with the exception of thfe scoring service, then 
those who would be inclined' to use item analysis should look 
seriously at a test with this feature. 

Item Analysis 

Item analysis involves identifying individual student 
rosponsoM on a test and determining how many correct responses 
won* given as well as what alternatives were chosen. A chart 
could be constructed with the objective of the test to be 
examined listed with other items identifying information* Atid-- 
itional information about the percent of students getting the 
answer correct could be shown for the class, the school, the 



. 61 B7 



district, and the nation depending upon the universe the teacher 
is using for comparison purposes. The responses for the cffbices, 
including, the choice to omit an answer, can be j^umnfarized for 
class performance analysis as well as for noting individual pupils' 
responses. . ; 

Such a process will take some time to accumulate or to con- 
sider fully even if computer scoring is used. Hand score^ tests 
can be summarized in the same way but it is a lot of work, es- 
pecially if all students and all responses are transcribed c^hto 
a worksheet. The teacher may be interested in a sample of ques- 
tions and could easily eliminate those which do not appear to J^e 
of use in examining the course objectives. A percent correct 
column could be used to determine how difficult an objective*' 
was *for all students. Test publishers will often assign the 
percent correct identification to test questions to show that a 
specific number of students were successful in selecting the 
correct answer. If the percent correct is high it can be as-^ 
sumed that the question was of a low difficulty level and likely 
was designed to identify those stu dent s at the lower end of the 
scoring scale who need assistance in this skill area. The qves- . 
tions with a low percent correct niunber will probably help the" ' 
teacher differentiate between the top scoring students and ,prssist 
with identifying those who need additional out-of-level testi.qg 
to more precisely identify strengths and weaknesses. ^ 

Consistent incorrect answers within clusters of objectives^ 
may fla^ students who will need additional help in an area. 'If 
the entire class scores below the average it may mean that^his 
objective needs to be looked at in terms of the local pnoprajn.* 
While the test may not me^t the specific criteria for tWa class 
in the way it covers content, it certarinly should be some Lntlic- 
ation of the need for the teacher to look further as to the 
cause for the wide differences. Generally it would be reasonable 
to identify those clusters of items which ten percent of large 
groups of students have scored lower than average, a fifteen 'per- 
cent difference for groups of less than fifty students. The 
test manual would likely indicate the level of significance for 
these various score variations and the teacher should be guided 
by their specific instructions. i 

Group Data f 

The use of group data can assist in making general statements' 
.about class performance. Group data can oe summarized on class • 
record sheets in which the scores for an entire class are pre- ' 
sen ted. * The class record form can be developed locally. If so 
it should Include the information useful to the district in its 
long term longitudinal studies. Generally test publishers may 
provide a suggested form or may be able to provide one with its 
scoring service. In any event the concern of the user should 
outweigh the convienience for the publisher. 



^ - 62 - 6^5 

ERIC . ' 



t 



covering all tests in print and all out-of-print tests once 
listed in MMY, a name index to authors of over 70,000 doc- 
uments (tests, reviews, excerpts, and referendfes) in the 
ceven MMYs and TIP II , and a scanning index for quickly 
locating tests designed for a particular population. 

ERIC Clearinghouse on Tests, Measurement and Evaluation, Educational 
Testing Service, Princeton, New Jersey, 08540 
Test information, bibliographies are available through the 
ERIC Clearinghouse and documents can be purchased through the 
ERIC Document Reproduction Service (EDRS) , Computer Microfilm 
International Corporation, P. 0. Box 190, ArWngton, Vir- 
ginia, 22210. 

Gronlund, Norman E. Me asurement and Evaluation in Teaching . New 
YorK: The MacmilLam Company. 1965. 

A good basic testing text for classroom teachers that includes 
simple, concise and straightforward discussions of most of 
the major issues covered in this m^^nual . 

Mehrens, WiJliam A. and Irvin J. Lehmann. Standardized Tests in 
Education , Third edition. New York: Holt, Rinehart and 
Wilaston. 1978. 

VB^etxidr chapters on reliability, validity, and reviews of 

some of the more commonly used standardizes tests. 

Northwest Regional Educational Laboratory. Guidelines for Selecting 
Basic Skiljls and Life Skills Tes ts. Portland, Oregon:, 
Clearingh^iise for Applied Performance Testing, Northwest 
Regional Education Laboratory^. 1980. 

Given the :<.mportant role test play in education, it is crucial 
that test users understand the fundamental principles^ of proper 
test use. These guidelines present some of those principles, 
focusing specifically on the selection and p\#rchase of published 
basic academic skill and life skills tests. However, though 
the guidelines focus specifically on tests of basic and life 
skills, the principles presented here can be applied to re- 
view, selection, and purchase of most achievement tests in- 
tended for use in educational settings. Aptitude tests — 
those intended to measure a student's capacity to learn — are 
not covered here. 

V. ^ 
To supplement the guideline's and further assist educators' 
test review an^ selection, the appendices contain extensive 
lists of currently available basic skills tests. Information 
is presented on test characteristics, publishers, and sources 
of additional, more detailed information. Although these 
lists are intended to be quite comprehensive, inclusiveness 
is not claimed. Readers are urged to consult the reference 
documents cited in the appendix for more comprehensive 
listings. 




lifULTISUBJHCT 
EVEMENT BATTERIES 



emeus. Lc«tbC&0 



Phonetic Analysis 

OnlRcMling 

S^aadTcU 

Do Ym Know 

TIiinkbThrottgh' 

TU^pIUkc 

tAtrntmul Environment Questionnaire 
we Teste of Basie Skills Expanded 
Forms S&T(CTBS) 



Matkematics 

LMpMifeAiis 

RcfcMOceSkiUs 



SocidStadies 
CriltrioB Test of Basic Skills 



Atiikmetic 



ittery 



(Arte 

IwMiTnti of Basic Skills 
MukMml Edtlton Forms Tft? 
Btodrng G^prehension 



Skills 
rSkilb 
WoA^tHdy Skills 
VoeabMlary 

lowoTcait of Edacatiooal Development: SRA 



^Arte 
SodilStodies 



MotrofoKMa Arkicvmient Teste <METRO 78) 
W t odBii g Comprehension 



Gndc 
Levdte) 

1<3 



Publication 

Dale PabUsher Rcffrencft 



1979 



AW 



Kindeisarten- 1976 
12 



CTBS 



Kindergarten- 1976 
8 



1-8 



3-9 



1976 



1978 



ATP 

STS 

HM 



9-12 



1974 



SRA 



Kindergarten- 1978 
12 



NOT Sept 79 



MMY 12 



MMY14 



TCBJan77 
pg.5 



NOT July 79 



MMY 20 



Ply. Corp. Fall 78 
NCME 



OOmmWDONfOUOWINiSPAGfi . - 

A-4 70 



MULTISUBJECT 
ACHIEVEMENT BATTERIES 



Tr^to Mid Subtcom 

Nttlional EdticalM>iial Dcvclopmcnl Tests 
Malhematics Usage 
Eii|{lish Usage 
Social Studies Readin| 
Natural Sciences Reading 
Word Usage 

Primary Survey Tests 
Reading 
Mathetnatics 



Grade , Publication 

LcvcKs) Date r Publisher Reference 



7-10 



1974 



SRA 



MMY23 



Language 
Spding 



Sclioiaslic Testing Service 
Educational D^elopment Series 
Sdnrfastic Tests 

'^^S^matics 
■M^insn 
Swnii! Studies 
Sacncc 

Sdving Evetyday Problems 
USAinthe World 
NonveriMt Ability 
Verbd Ability 
SdUKrf Interests 
SdMolPlwM 
Career Plans 

Science Rescardi Associates 
Aduevement Scries (ACH) Forms 1&2 

Mathematics 
LaofiMge Arts 
Social Studies 



2-3 



2-12 



1973 



1976 



SF 



STS 



TIP 27 



MMY20 



Kindergarten- 1978 SRA NCME 

12 Fall 78 



Reference Matcriab 
Applied SIdBs 

Science Rcseareh Assodaies 
Hq;h School Placement 
Rea^ig 

Arithraetic/or Modem Math 
Language Arts 
Social Stndica 



1973 



SRA 



TIP31 



Science Rescnrdi Associates 
Norm R efei f n c td /Cititerion Referenced Testing 



4-10 



1977 



SRA 



TCBJuly77 



ONffOUbOWINGPAGI 



A - 5 



71 



Stanford Diagnoctic Reading Tc»i (SORT) 1-13 

WiMMMin Design for Reading Skill Development K-6 
Woodcock Reading Mastery Tcsa (WRMI) K-12 



1976 


Pky. 
Corp. 


MMY777 


1972 


NCS 


MMY778 


1973 


AGS 


MMY779 



MATHEMATICS TESTS 



Test 

Analysis of Scab: Mathenutics (ASK: 
MatlMMotics) 

Assessment of Skills in Computation (ASQ 
Basic Arithmetic SkiU Evalaation 
Diagnosis: An Instructional Aid: Mathematics 
Diunostic Mathematics Inventory (DMI) 
OicvisionofthePMD 

Oiognostic Screening Teet: Math(DSTM) 
ERB Modem Arithmetic Test 
Fomttain VaBey Teacher Support System in 
MatheaMtics(FVTSS.M) 

Indhridwd PisfO MonitMing System-Mathematics 

KcymaA Diagnostic Arithmetic Test 

Mastery: An Evahiatiim Tool: Mathematics 

M a them ati cs : lOX Objectives-Based Tests 

Mimowm Essentiab for Modem Math 

Ob^cliwcs-Referenced Bank of Items and Testa: 
Mathematics (ORBIT:M) 

Stanford Diagnostic Mathematics Test (SDMI) 
StccnbmgCM Quick Math Screening Test 

Testa of Adiievement in Basic Skills: Mathematics 
(TABS:M) 



GnMk 


Publicatioa 








Date 


nuMisner 


Reference 


1JI 


197o 


STS 


MMY25I 


7-9 


1978 


era 


NOT Oct 79 


1-9 


1974 


nLC 


MM Y 303 


1-6 


1974 


SKA 


MMY 263 


1-8 


1975 


era 


MMY 264 


1.11 
1-11 


1979 


sc 


NOT Nov 79 


54> 


1971 


ERB 


TIP 718 


K-8 


• 1974 


Zweig 


MMY 270 


1-8 


1977 


EDC 


MMY 275-6 


1-8 


1973 


RS 


MMY 274 


K-6 


1976 


AGS 


MMY 305 


K.9 


1976 


SRA 


MMY 278 


K-9 


1976 


lOX 


MMY 279 


64 


1971 


Hayes 


TIP638 


K-Aduh 


1975 




MMY 287 


l->duh 


1976 


Ihy.Corp. 


MMY 292 


1-6 


1978 


ATP 


NOT Feb 79 


K-12 


1976 


EdITS 


MMY 293 



UFE SKILLS TESTS 



Test 

Addt Performance Level Functional Literacy Test 
Assessment of Skills in Computation (ASQ 
Everyday SkiBs Testa (EDST) 
lOX Basic Skiib Test 

19*4 Consumer Mathematics Test 
Eleading/ Everyday Activities in Ufc (R/EAL) 
8R#. Coping Skilis: A Survey plus Activities 



Grade 


Pnblicidon 




LevcKa) 


Date 


Publisher 


9-Aduh 


1978 


ACT 


7-9 


1978 


CTB 


6-12 


1975 


CTB 


9-12 


197B' 


lOX 


9-12 


1973 


NMS 


9-Aduh 


1972 


CAL-P 


7-8 ft Adult 


1979 


SRA 



Reference 
FLIT Pg. 42 
NOT Oct 79 
MMY 18 
NCME 
S|Mring 1979 

MMY 312 

FLITPg..ir 

NCME 
Special 

Edition 1979 



ERIC 



SR A Survival Skilb 

STS Edurational Develop aetit Series: ScholasUe 
TesU 

S4*ni«>r IIikH Assessment of Reading Performance 
Forms A. B.C (SHARP) 



Stories about Rcal-Life Problems 
Test of Consumer Competencies 

Test of Everyday Writing Skills (TEWS) 



"iVstsofPerf ormanee inXomputational SkiUs 
(TOPICS) 

Wisconsin Test of Adult Basic Education 



6-Adult 


1976 


SRA 


TCBJul77 


2-12 


1976 


STS 


MMY27 


in 19 






1 •fill i 9 








Pg . II 








(Form A> 








NCME 








Winter 77 








(Form B) 


5-8 


— 


NIU 


NOT May 79 


8-12 


1976 


STS 


TCBJan 








Pg.6 


9-12 


1978 


CTB 


NCME 








Spring 78 


9-12 


1978 


CTB 


NCME 








Winter 77 


Aduk 




RFD 


FLIT Pg. 48 



LANGUAGE ARTS TESTS 



Test 

Analysis of SkiOs: Language ArU (ASK: 
Laiqpuqje Arts) 

DiagnoatkS reenirgTest: Language 

Language Arts: lOX Objectives*Based TesU 

Language Arts: Minnesota Hi^ Schod 
Achievement Examinatkms 

Writing Test: McGraw-HiD Basic SldOs System 



Grade 


PnblicatioD 






LevcIU) 


Dale 


Publisher 


Reference 


2^ 


1976 


STS 


MMY41 


K-Aduh 


1977 


SC 


NOT Oct 79 


K-6 


1974 


lOX 


MMY53 


7-12 


1970 


AGS 


TIP 90 


11-12. Adulto 


1970 


MHBC 


TIP 125 



'3 



A - 0 



Pnblishen' Names. Addresses 
and Telephone Nui^; jers 

Tkk Afpe»Jix be mtt/mNishm tdentifietl 
im ApptiuSx A. 

ACT Hie Amcriran GiUege Testing Progimm 
P.O. Box 168 
Iowa Gty. Iowa 52240 
(31» 356-3711 

AGS Americui Guidance Service. Inc. 
Publisher's Bldg. 
Ofde Pines. MN 55014 
(612)786^343 

ATP Academic Therapy Publications 
28 Commercial Blvd. 
Novate. CA 94947 
(415)883^14 

AW Addison-WesleyPublubingCo..-Inc. 
Jacob Way 
Reading. MA 01867 
(617)944.3700 

BFA BFAEdMcational Media 
2211 Midiigan Avenue 
P.O. Box 1795 
SMrta Monica. CA 90406 
(213)829-2901 

BMC Bobbs MerriD Co.. Inc. 
4300 West 62nd Street 
ladiaMpoGs. IN 46268 
(317>298-5400 

CAI CarriciAini Associates. Inc. 
SEsqaireRd. 
KBiBerica. MA 01862 
(617)935-8410 

CAUP CAL Press. Inc. 

76 Madison Ave. 
NewYork. NY 10016 
(212)685-0892 

CARE Thg Center for Applied Research in 
RtSR 

West Nyack. NY 10994 
(914)358-8991 

CraA CroA Incorporated 
4922 llarfttrd Road 
BakitiMfc.MD21214 
001)254-5082 



CSDE California State Dept. of Education 
721 Capitol Mall 
Sacramento. CA 95814 
(916)445-4688 

CTB CTB/McGraw Hill 

Del Monte Research Park 
Monterey. CA 93940 
(408)649-8400 

EDC Educational Development Corporation 
P.O. Box 45663 
Tulsa. OK 74145 
(918)622-4522 

EdITS EdITS/Educational and Industrial 
Testing Service 
P.O. Box 7234 
San Diego. CA 92107 
(714)222-1666 

ERB Educational Records Bureau 
Educational Testing Service 
Box 619 

Princeton. NJ 08540 
(609)921-9000 

EPS Educators Publishing Service 
75 Moulton St. 
Cambridge. MA 02138 
(617)54'/-6706 

HAYES Hayes Educational Laboratory 
7040 North PorUmouth Ave. 
Portland. OR 97203 
(503)285-3745 

mXI Imperial International Learning Corp. 
Box 548 

Kankakee. IL 60901 
(815)933-7735 

lOX Instructional Objectives Exchange 
Box 24095 

Loa Angeles. CA 90024 
(213)474-4531 

Jaatak Jastak Associates. Inc. 
1526 Gilpin Ave. 
Wilmington. DE 19806 
(302)652-4990 

McGratb McGrath Publishing Co. 
P.O. Box 9001 
Wibnington. NC 28402 
(919)763-3757 

Merrill Charles E. Merrill Publishing Cu. 
1300 Alum Creek Drive 
Columbus. Oil 43216 
(614)258-8441 

MHBC McGraw Hill Book Co. 

1221 Ave. of the Americas 
NcwYorit. NY 10020 
(212)997-1221 



- 1(74 



NCS 



Nil) 



NMS 



Pky.Coqi. 



RFD 



RS 



SC 



SMIC 



SOI 



SRi^ 



NCS Interpretive Scoring Systems 
4401 West 76lhSL 
MinnrapnlU. MN 5S435 
(800)328-6290 

Northern llfiiiois University 
Aim M. Voelker 
Oirriculuni and Instnictioii 
Dekmlb.lL 60115 
(815)753-1000 

New Mexico Sute Dept of Education 

Monitor 

Education BIdg. 

Sute Capitol 

SanuFe.NM 87501 

(505)827.2429 

The Psychdogical Corporation 
304 E. 45th Street 
New York; NY 10017 
(212)888-3500 

Rural Family Development Program 

University Extension 

Universi^ of V^sconsin 

P.O. Box 1379 

Madison. WI 53701 

(60» 262-1234 

The Riverside Publishing Company 
1919 South Highland Avenue 
Lombard. IL 60148 
(312)629-9700 

StoekingCo. 
13S0S.Kostncr 
Chicago. IL 60623 
(312)522-4500 

Southwest Regional Resource Center 
127 South Franklin 
laneau.AK 99801 
(907)5864806 

Scoli iFmcsnan ft Co. 
1900 East Ldtc Ave. 
GItnvicw.IL 60025 
(312)729-3000 

SOIInatiliilt 
214 Main Si. 
EIScgH«do.CA9024S 
(2m322-S99S 

Sdencc Rescardi Associates. Inc 
ISSN.WariMrDr. 
aiicaio.IL 60108 
(800)ttl4884 



TCP Teachers College Press 
Teachers College 
S25Wcstl2(HhSt. 
New York. NY 10027 
(212)678-3929 

USDL United States Dept of Labor 
Bureau of Labor Sutistics 
1515 Broadway 
New York. NY 10036 

(212) 399-5405 

'Windi B. L Winch and Associates 
45 Hitching Post Dr. 
Rolling Hills EsUtes. CA 90274 

(213) 547-1240 

Zweig Richard L Zweig Associates. Inc. 
20800 Beach Bhrd. 
Huntington Beach. CA 92648 
(714)536-8877 



SIS ScfciilMtieTcsliii8Scffvic«.Inc. 
480Mcyci'RM4 
Bcnse«viltc.lL80108 
<S»>788-7I98 



75 



A - 11 



Reference Materials Dcseribing 
and Reviewing the Tests 

Mentel Measuremente Yearbook (MMY) 

Of ttt sourcct cited* The Eighth Menial Measurt* 
mmU YwmrboiA provided the most comprehensive in* 
formaiioii on tests. To be&t utilize the source, read the 
intraductofy section* **How to Use This Yearbook.'' 
Tests are indexed in the yearbook by test number. 
That nmnber is provided in the previous test lists as 
the loBowHip reference number. However, the num- 
ber of any test can also be located via the yearbook 
tMe or siilbject indices. Information provided about 
eachlcstincliides: 

Tide 

DcscimtMMi of the groups for which the test is in* 
seoQes 

Dale of copyright or publication 
Acronym 



lor group test 
Forma, parts* end levels 



Mac hin e^ s o^dile answer sheets 
Costs 

Seprii^ and reporting services 



Author 
PubGslier 
Foreign adaptation 



Cioss references 

Addiik mal references to published articles, books 
and mnnbfiriird theses on the construction, validity. 
Me Md Kmkntkms of each test are reported as part of 
eadi lest entry. Original reviews of each test by inde- 
pendent meaaunnneni cxpcru arc provided. 

Tcgig in Print <nP) 

The co mpa nion volume to the Yearbook is Tests in 
him II. It pfuvidca the reader with similar but much 
Wsa det a il ed information on tests. Again, a section 
eniidrd **llow4o Use This Bwik** is provided. Tests in 
II prrsmu a bibKography of all kno%vn tcsu 
pHMshed for Knglish-sp^^king subiccu and on index 
to all tens published in previous editions of the Men* 
9ol »9eamn ment Yearhooh. TIP II provides the ful- 
li«nnigHiiiimMitlon: 



Tade 

Test population 
Copyn;ht date 
Acronym 
Special comments 
Part scores 
Author 
Publisher 

Foreign adaptations 

Cross references within TIP 11 

Sublistings 

NCME Measurement News (NCME) 

Hie "NCME Measurement News/' the official 
newsletter of tlie National Council on Measurement 
in Education, provides a brief description of. recently 
puUished tesu. Information includes publisher 
copyright, subject matter, levels, grade, interpreting 
manuads and costs. It is suggested that individuals de- 
siring additional information contact the publisher 
direcdy using the addresses provided. Tests ap- 
pearing in the nev^letter in the newsletter do not rep- 
resent endorsement by the NCME or iu staff. 

Newg on Tests (NOT) 

ETS "News on Tests'' and iu predecessor, the "Test 
Collection Bulletin." provide descriptions similar to 
the NCME newsletter. The test title, author, publisher 
and address copyright and grade level arc included, 
with a brief statement of content and levels. Informa- 
tion provided is descriptive rather than e%'aluati%x. 
and Educational Testing Service "News on Tests" also 
indttdes announcements of new publications relating 
to testing, conferences, and available test bibliogra- 
phies. 



Tests of Functional Literacy (FLn*) 

The review of currently available Tests of Fw*^^ 
titmnl Atlult Literacy provides information on ^ 
diaracteristics and quahty of a series of standardizec 
criterion referenced, and informal tests of literacs 
Included in the descriptive profiles of tests is informa- 
tion on publisher, content and skill coverage, avail- 
ability of alternate forms, administration prm-edurcs. 
materials needed, scoring pnMredures. interpretation 
procedures, validity and reliability. 

The reader is urged to take advuniage of these in* 
formational documents and to contact test publishers 
for complete information on available tcsu. 



A - 12 76 



Major U.S. Publishers of Standardized Tests 
from tho Test Colloction. Kducational Testing Service 



The following publishers are listed in this collection which 
were not listed in the collection supplied by the Northwest Reg« 
ional Educational Laboratory. 



Bureau of Educational 
Measurements 

Kansas State Teachers College 
Emporia, KS 66801 
316-343-1200 

Bureau of Educational 

Research & Service 

C-20 East Hall 

The University of Iowa 

Iowa City, lA 52240 

319-353-2823 

Committee on Diagnostic 
Reading Tests, Inc. 
Ilountain Home, NC 28758 
704-693-5223 

Consulting Psychologists Press 
577 College Avenue 
Palo Alto, CA 94306 
415-326-4448 

Educational Testing Service 
Princeton, NJ 08541 
609-921-9000 

Western Office: 
1947 Center Street 
Berkeley, CA 94704 
415-849-0950 

Follett Publishing Co. 

A Division of Follett Corp. 

Department DM 

1010 West Washington Blvd. 

Chicago, IL 60607 

312-666-5855 

Ginn and Company 
P« 0. Box 2649 
125C Falrwood Avenue 
Columbus, OH 43216 
^U-253««6ei 



Grune and Stratton, Inc. 
Ill Fifth Avenue 
New York, NY 10003 
212-741-6800 

Guidance Testing Associates 
of St. Mary*s University 
1 Camino Santa Maria 
San Antonio, TX 78284 
512-436-3304 

Institute for Personalitv and 
Ability Testing (IPAT) 
1602 Coronado Drive 
Champaign , IL 61822 
217-352-4739 

Martin M. Bruce, Publishers 

340 Oxlord Road 

New Rochelle, NY 10804 

914-235-4450 

Priority Innovations, Inc. 
P. 0. Box 792 
Skokie, IL 60076 
312-729-1434 

Psychological Research Services 
Case Western Reserve University 
1695 Magnolia Drive 
Cleveland, OH 44106 

216- 368-3536 

Psychological Test Specialists 
Box 1441 

Missoula, MT 59801 

Psychologists and Educators, Inc 
Suite 212 

211 West State Street 
Jacksonville, IL 62650 

217- 243-2135 



77 

A - 13 



Psychometric Affiliates 
Box 3167 

Munster, IN 46321 
219>836-1661 

Richardson, Bellows, Henry 

ttttd Co. , Inc. 

1140 Connecticut Ave. 

Washington, DC 20036 

202-659-3755 

Sheridan Psychological 
Services, Inc. 
P. 0. Box 6101 
Orange, CA 92667 
714-639-2595 



University Bookstore 

Purdue University 

360 State Street 

West Lafayette, IN 47906 

Western Psychologicnl Services 
12031 Wllshlre Boulevard 
Los Angeles. CA 90025 
213-478-2061 j 



wmumm 





IMTIOMAI. IMSmun Of lOUCATIOM 
iOUCATtONAL Rf SOUMCf S INfOWMATION 
CINTffl(iMC) 



BY 



NM* bMA m0d0 10 inn^Dw* 



mint do n€l n#c>w**1(v 



offtcttlNIS 



DIRECTOR OF RESEARCH 
ASSISTANT dlRECraSfoF RESEARCH 



AMEMCAN FEDERATION OF TEACHERS, AFL-OO 

OCTOlEII,1tlO 

73 * 



EXERCISES 

Problem #1 - 

Using as much of the (tata about the district and tests 
available, select the test that best suits your district '.s 
needs. * Your goals for testing should be determined and the 
only limitation is that you are under a mandate to offer general 
evaluation data to the schools and public through the grades. 
The cvujuai^ion instrument selected must be supported by what- 
ovor facts you can draw from the information. Your process of 
Kolection and other considerations should be noted. 

Problom #2 

You have selected a test and the scores are reported for 
the district, schools, for a class, and for some students. 
Groups one and two will evaluate the scores for the district 
and prepare a statement to present to the district. They must 
bo prepared to answer any questions regarding their analysis 
and should prepare ''a short presentation which will be given at 
(1) either the district board of education meeting or (2) a 
press conference called to announce the results of the testing 
program. 

Groups three and four will analyze the class data. Group 
three s'hould be prepared to discuss the use of the data in 
making generalizations about the class performance, strengths 
and wciiknes.ses of pupils, and recommendations for individual 
students. , 

Group fc^ur will evaluate the test scores in light of pro- 
posed curriculum changes or additions that may be necessary. . 
They will meet with a parent council to discuss the necessary 
changes in light of problems identified with subject matter 
ureas. \ 

Group five and other groups will review the scores of stu- 
drnts listed on the school's score sheet. They will prepare 
comm^mts for pa^rents of the children and present four of the 
eight in the final review. If necessary they must defend the' 
scores and the test. 



80 



/ HARMON, U.&IL 



Population 1975 
14285 

Numbc^r nl Students 
ii480 

MiiAilxM' of Schools' 
7 



lllKh School , Dewey 
1515 students; 55 
touchers; 19.5:1 PTR 



Juni<»r IliKh, Kennedy 
1105 students; 64 
t#uth<rs; 18:1 PTR 

KleirM fiLary Schools - 
100 teachers ; 
iHii r - f?00 
Vhid'Hx A 656 
Xorih - 700 
Brett - 625 
Rose * 650 i 
VTE: 17:1 




Harmon. U.S.A., is a town with a high school, junior high,- and five- 
elementary schools. The enrollment in recent years has been declinin^^ 
slightly, but in most respects it is an average small district near 
an urban center. It has a manufacturing plant, assorted medium and 
small businesses offering services to the residents, a state college, 
and is served by major highways, railroads, and boasts-ono uihport. 
Sky Harbor International Airport. 

* 

The schrKjlK have grown up around the city. The Junior High occupies 
tho former high school building. The airport was built adjacent to 
tn<« Duff Elomontary School on land acquired by the district and later 
wold to the city when enrollment declln'bs offset the need foj" further 
construction. As a consequence the development of housing has been 
pressing to the east part of town and to the north. Elementary 
enrollment has been inpreasing slightly and changing in compos-it ion 
of the student body in recent years. 

The. spendable .family income of $10,360 is slightly below the nations l\ 
average of $10,504. While the national, population's education level \ 
indicatcjs that 52 percent graduate from high school in Harmon 54 
percent have eompleted high school. The ethnic breakdown for the 
icuBmunlty Is 71 percent white, 19 percent black, 6 percent hispanic * 
and 4 pcrcpnt other.- Recently a sizable group of refugees have I'e- 
locatcd in the city and their ch.ildren have enrolled in public schools. 




81 



» 



11 OUPONTanCLE. N.W., WASHINGTON, D.C. 20030 



I 



MEMORANDUM 
To: Staff 

From: Test Evaluation Committee 
Re: Technical Data Comparison 
Okte: December 3, 1979 



The subcommittee on test comparison met and reviewed a number 
of tests over the last three months. We have determined that 
one battery of tests will best for our purposes as outlined 
by the coQulttee on evaluation in their memo last spring 
(attached). In keeping with the direction of the committee, 
however, to present three choices for final comparison, we 
are reviewing in this memo the Standardized Achievement Series 
Assessment Survey (SASAS), the Urban School Estimate (USE) 
and the Criterion Ability Check of Potential (CRACPOT) . Copies 
of the various tests are available for your consideration. 
In addition we have compared some major points on each of the 
tests and summarized the information. 

We have taken our Information from the test manuals, 
from Euros Mental Measurement Yearbooks, and from publications 
In which the tests were reviewed. Where applicable, we have 
Included comments from those documents. 



TEC/ag 

opeiu#2af Icio 



ERIC 



82 



COMPARISON OF TUSTS 
HARMON SCHOOLS TEST EVALUATION SUBCOMMITTEE 



Item 



tel lability 
T«8t-Rete8t 
fSqulvaleqt forms 
Split Half (RII) 
!fR 20 
KR 21 

Standard Error 
of Measurement 

Validity 
Content 



SASAS 



.87 

"^ot available 
.91 
.75 

Not available 
2.4 



Acceptable match 
to dlstVlct - 2nd 



Criterion 
Related 

Predictive * 



Concurrent 



Con9.nict 



7Ace 



closest of those 
reviewed 



.85 correlation 
with future test 
scores in sample 
test of 100 fall 
administration 



USE 



.85 

Not available 
Not available 
.67 

Not available 
2.9 



Fails to match our 
lang uage o bj ec- _ 
tives, TlalH""^ 



CRACPOT 



.73 correlation 
with spring scores 
on fall adiiiinis- 
tratioii to sample 
of 100 



Not available 
Not available 
.86 
.86 

Not available 
3.1 ~ 



Closest to our cur- 
riculum & processes 



Predicted stanine 
ranges in 80% of 
the cases 



Predicted stanifie 
I'Anges in 95% of 
the cases dh 
Reading test 

The committee revlJewed the test manualts 

that SASAS aD^ *SE 
3hi losophy . UiiACPOT 



vii: 



and other reviews 
1 be compatible 
was too oriented 
toward single scoije progress identlf idatlons 



nd are satisfied 
with our district 



Test appeared to 
be take n^ ser 1 ously 
In sample; was 
easy to read, 
mark , and accept 
table 



Students were some 
what con fu s e d l>y- ^ 



83 



presentation of 
math computations; 
oral response por- 
tions were subject 
to variations with 
administrators of 
tests, some por- 
tions acceptable 
but answer sheet 
difficult to foXlovi 
for students re- 
sulting in some 
obvious mismarking 
of choices. 



.60 correlation with 
spring scores after fall 
administration to sample 
Of 100 

Stanine ranges not 
available except by 
dividing percentile 
ranks 



Use of plc-to-grams 
Xa lied to -g ive stu d ent s — 
confidence in test. 
Test booklet was printed 
in light green and 
purple, making it dif- 
ficult to read. Not 
accept abl§^ 



*^ i^mgple test was adii. nlstered to 100 students in fal^ and test currently 
TRir^ aUtaiinlstered in spring of following year. Th^ reading level test 
^^L^ ftMtnittered ftt the same time in tLe fall and stanine ranges were 



COMPARISON OF TESTS 
HARMON SCHOOLS TEST EVALUATION SUBCOMMITTEE 



Norm Data 

Studant saaple 
Bize 

Standardization 
Uonth, year 



StatOw or 
Regions 



Community Size 



USE 



CRACPOT 



275,000 

April, October 
1977 

Proportionate al- 
location as to 
U.S. Census 
Repor t - 

35^ Urban, 50% 
Small Town or 
Suburban, 15% 



200,000 

May, September 1975 

12 Major urban 
cities in U.S. 



100% Cities 500,000 
or more 



1,950 

January 1979 

50% North Central U.S. 
40% Western U.S. 
10% Eastern U.S. 

60% Small town. 20% 
Rural, 20% Urban 



Sex 
Age 

Race 

ptb ^ Data | 
Scores Reported 



Cost (per pupil 
Includes all 
manuals) 

¥lBe TOT-" r — 

Administer 



Rural 

48.6% Male 
51.4% Female 

5% each ages 5-7; 
16-17; 75% ages 
8-lf evenly 
divided 

White 70%; Black 
20; Other 10% 



Percentile Ranks, 
Scaled Scores, 
Raw Scores 

$1.55 + $.40 for 
ability test 
recommended 



Scoring Services 



~3~hrs. plus 50 
mins. for pbility 
test 

$.35 each machine 
scoring 

Various services 



ERLC 



48% Male 
52% Female 

4% age 5; 8% all 
ages 6-17 



White 37%; Black 
54%; Other 9% 



Raw Scores, Percen- 
tile Ranks, Stan- 
ines 

$.75 



Ir hr. 45 mlTTS^ — 



$.10 - $.60 depen- 
ding on machine 
scoring chosen 



54% Male 
46% Fenale 

Ages 5-12 evenly 
divided 



White 80%; Black 18^^; 
Other 2% 



Raw Scores, Percentile 
Ranks, (P) Values 

$1.10 



"40" mtns . i>tus" scorlTig^ 



hand scoring only 



84 



ii 



standardized Achievement Series Assessment Survey (SASAS) 

Results of the SASAS Assessment Survey, administered in October 1979 
to Harmon public school students in grades 4, 6, 8, and 11 are re- 

-^^^-^^^-^i^'^^^ P^^- ^^^sc^res^^i^^^ the average per- ' 
rormance in regular programs; those obtained by children in self- 

^?'**5''^'"f gifted/talented and learning disabled) 

are not included in school averages. 

lift^i.^htf^K^H'^r^?"''''®^ ^ norm-referenced ability/achievement 
test published by Simpson, Sampson and Belinski Associates. It is 

sMiple students' achievement of the concepts and skills 
irtZ sub?est';'?S??iw?'°''""" ' description of the content 

Estimate measures general educational ability based 
upon thos' factors most closely associated with academic performance 
I.e., measures of verbal, number, and reasoning abilities. It is 
designed to assess the student' s present academic aptitude. 

The Reading Comorehenslnn test measures the ability to understand 

ur ^[ n.! f".1! ! ?? l°Kical conclusions, and retain 

aignlficauL details. The selections represent several subject 
areas: fiction, biography, science, and social studies. 

IhL?^^K^2!7°'^^"u"^, ^^^^ measures recognition of synonyms for 
short phrases and knowledge of .words as they appear in written 
context • 

^""^^ test measures knowledge of basic elements 

required for correct and effective writing. Included are capital- 

2?;/""''^"*^*°"' sentence and paragraph structure; use 
of modifiers, npuns, verbs, and pronouns; and diction. 

wSrd s*"^"''^^ ^""^^ Spelling test measures recognition of misspelled 

I!!!^ y^!*'^'"S^^^fL^°"''^P^° measures understanding of basic num- 

eration and mtxthematical operations plus knowledge and application 
or concepts in measurement, geometry, and problem solving. 

The Mathematics Computation test measures ability to handle com- 
putational operations involving addition, substraction , multip- 
iication, and tftvlslon of whol€^ numbers; wBoIe"nSim^F Krd 'ps 
fractions, decimals, and percents. 

The Mathematics test (grade 11) presents exercises involving prac- 
tical, reaTlatlc situations as well as more abstract exercises 
involving number systems and more sophisticated mathematical concepts. 

The Social Studies test measures knowledge and appropriate appllc- 
of concepts in geography, history, economics, sociology 

ISJ'iMitMeiJiJs*! "^'^"^^ ^^"^ '""^ ^"^''''^ ^° 



85 



Tlu« Science li St incjisurcs knowlcd^f and appropr i ;i I c ;ipp I icjil i on ol 
cort-cpts in bioloKy. matter and cnerKy. earth and space, and expei- 
iiiK'ntutton plUH the ability to use written and illustrative 
niuloriuls . 

The Uses of Sources test measures ability to use basic sources of 
information, such as tables of contents, indexes, dictionaries, 
reference books, library catalogues, map^, charts, and graphs. 



8G 



Urban School Estimate (USE) 



The USE test was Investigated by the test committee and a sample of 
students tested in the procoss. We found it to be a competitive 
tost In many respects for the regular classroom population. It Is 
a norm*referencod achievement test published by Center Evaluation 
and Measurement Systems, Inc. of Metroville, Ohio. It is designed 
to sample student achievement throughout grades K-12. It is in- 
tended primarily for the urban market and tak'?s into account some 
of the specialized programs found in these districts. Often such 
programs are progress- linli^ed as opposed to traditional grade ad- 
vancement progress measures that require specific currlcular content 
at each grade level • The USE test is designed to take advantage 
of this kind of currlcular system and allows the test user to 
reference to national norms as well as establish the level of achieve- 
ment of the student. Sensitive measures of the student's ability 
and achievement are possible In districts where the curriculum 
closely matches the test's content. 

Prior to the development of the test Items, extensive survey and 
examination of major urban districts* curricula was made. While 
each district has its own characteristics, the authors believe 
that a majority of districts who have been in the move back to 
basics ill benefit by use of the USE test in their district. 
Math and language objectives in these areas are geared to the 
recently published series of texts by Bates-Unlverse Press which 
have become the standard for basic education programs. 

The Reading test covers vocabulary, comprehension, and provides 
a tot«I score for the section. Grade level may be assigned, but 
the test recomnends that levels in the Bates -Universe series be 
utilized for inferences about level. The vocabulary has been re- 
viewed to eliminate potential bias and ambiguity that many tests 
include due to their orientation to standard English without 
regard to bilingual students or students who are familiar with 
non-standard English. 

The Mathematics section also follows the Bat es-Un lver.se series 
and levels can he eatablished independent of reading scores. The 
section on series and relations of numbers fits quite well with 
more standard conceptual and computational sections of tests. 

Writing skills emphasizes spelling and proofreading, common errors 
of students, and relies on the vocabulary and spelling words in 
the Bat es-Uni verse series for 80 percent of the words. Distractors 
in the proofreading portion of the test are uniformly difficult 
and speed is stressed. 

Study skills with some map reference, dictionary usage, and ref- 
erence material location are tested in this section. 



ERIC 



Cr I tor Ion Ability Check of Potential (CRACPOT) 



CRACPOT, a new test published in 1979 is perhaps the test we all 
have been waiting for. Because this test is not referenced to any 
norm group, but rather to criteria, no more can students be 
burdened with the knowledge that they are below average or in the 
failure mode. Simply test, test, and test again until the student 
gets it right. By use of the hand scoring system and the relatively 
short administration time for the test, the teacher can pinpoint 
precisely what the student has learned and -ivhat should be studied. 

Critics of district test scores will have to adjust their sights 
for some time to come when you begin reporting the numbers of 
students ,vho are able to successfully pass the test. Progressive 
difficulty levels enable you to test the student for any selected 
criterion level and students can actually expect to pass provided 
the level is suited to their raTe of learning. 

Alternate score reporting formats are available to suit purposes 
of the district. The Standard Operating Scores format is a 
scaled score system that will allow districts to report specific 
large numbers to the public and provide teachers with percent 
correct information. Hand scoring encourages the teachers to get 
in touch with t'jeir students and saves a considerable amount of 
money over machine scored services. The turnaround time for 
scoring these tests depends upon how dedicated the teachers are 
or upon the amount of leadership pressure exerted hv district 
officials, so anything is possible. 

The test was standardized throughout the country and in the opinion 
of the authors, most districts will be able to match their cur- 
riculum objectives to the various test items. Quick, inexpensive, 
and sensitive to the problems of educators today, the CRACPOT 
will get the public off your back once and for all. 



Standardized Achievement Series Assessment Survey (SASAS) 
National Percentile Equivalents for Mean Scores 











|r £ / 


i D I 


N G 


L A 


N G U 
ART 


AGE 

S , 


1 


! A T 


H 




School 


Grade 


Ability 


Composite 


Compre- 
hension 


Vocab- 
ulary 


Total 


Useage 




Spelling 


Total 


Concepts 


Computa- 
tion 


Total 


Social 
Studies 


Science 


Use of 
Sources 


EleTOntary 


































UUI I 


4 


44 


43 


48 


42 


43 


50 




47 


47 


52 


54 


54 


45 


47 


45 




6 


52 


64 


50 


52 


49 


61 




61 


59 


58 


69 


71 


48 






Chlddix 


4 


85 


87 


84 


87 


86 


86 




78 


85 


80 


81 


81 


87 


85 


87 


North 


6 


85 


84 


81 


85 


84 


83 




80 


83 


O f 


• ^ 


Afl 
ou 




# D 


QQ 
OO 


4 


62 


59 


58 


60 ' 


61 


69 




62 


€2 


62 


43 


54 


63 


59 


67 


Brett 








V X 


OX 


Dl 


cl 




Dl 


CO 

oy 


66 


49 


57 


45 


42 


48 


4 


49 


53 


58 


57 


57 


58 




57 


56 


52 


40 


51 


56 


51 


58 


Rose 


6 


49 


59 


53 


49 


51 


59 




6i 


58 


58 




57 


45 


54 


62 


4 




OD 


O 9 


O # 


OO 


A A 




•J,' 


d / 


40 


32^ 


35 


36 


38 


40 


Junior Hish 


6 


37 


39 


40 


31 


24 


37 




42 


38 


36 


49 


41 


34 


31 


43 


Kennedy 


8 


58 


61 


56 


53 


54 


55 




54 


56 


62 


63 


54 


50 


49 


61 


Senior Hiich 


































Dewey 


11 


59 


65 


55 


62 


62 


60 




48 


53 


68 


64 


66 


59 


55 


54 


Student 


































Chop, H. 




bo 


OO 


/O 


80 


81 


S3 




74 


79 


85 


87 


89 


79 


97 


83 


Cartwright, C* 


4 


49 


50 


58 


50 


53 


52 




62 


54 


43 


52 


45 


45 


51 


49 


. Meyers, H. 


6 


83 


80 


80 


78 


81 


69 




74 


70 


83 


77 


83 


72 


73 


70 


Turner, J.L. 


6 


52 


54 


58 


52 


50 


59 




55 


57 


49 


45 


48 


55 


51 


62 


Southern, B. 


8 


80 


84 


84 


81 


84 


78 




75 


78 


87 


84 


8P 


82 


81 


85 


Duncan, D. 


8 


70 


69 


69 


66 


73 


62 




71 


62 


71 


68 


71 


72 


67 


71 


Mar V Ji uez , it . 


11 


55 


75 


61 


61 


62 


74 




61 


68 


HI 


89 


86 


61 


P3 


73 


Ktt, A.L. 


H 


53 


43 


38 


29 


34 


37 




24 


28 


10 


58 


52 


43 


49 j 


45 


ERIC 


























1 






90 



Standardized Achievement Scries ADScssinent Survey (SASAS) 
National Percentile Equivalents for Mean Scores 
(National average 50th percentile) 



ERIC 



91 



Students 
Grade 4 



R.F. 
T.C. 

J.B. 
R.N. 

C. N. 
B.P. 
K.K. 
S.T. 
E.W. 
E.G. 
J.D. 

A.L. 

J.H. 
M.G. 

P.J. 

E.P. 
W.S. 
U.D. 
M.O. 

D. P. 
G.ll. 
M.S. 
M.T. 



< 



44 
52 
62 
85 
85 
55 
49 
44 
44 
37 
23 



44 

52 

59 
67 

55 
53 

83 
52 
49 

70 
85 
70 



« 

■H 

(0 

o 
a 



43 
64 
59 
84 
87 
62 
53 
35 
35 
39 
24 

61 

43 
64 

65 
64 

75 
43 

80 
54 
50 

69 
86 
69 



READING 



I c 

0) o 

U -H 

a a 
§ S 



48 
50 
58 
81 
84 
61 

58 
37 
37 
40 

21. 
36 

48 

50 

55 
63 

61 
38 

80 
58 
58 
69 
78 
69 



Xi >» 

4 u 

o (« 
o 

> 0 



42 
52 
60 
85 
87 
61 
57 
37 
37 
31 
18 

53 

42 

52 

62 
65 

61 
29 

78 
52 
SO 
66 
80 
66 



o 



43 
49 
61 
84 
86 
61 
57 
35 
35 
34 
19 

54 

43 

49 
62 

66 

62 
34 

81 
50 
53 
73 
81 
73 



LANGUAGE 
ARTS 



o 

M 

C3 



50 
61 
69 
83 
86 
61 
58 
44 
44 
37 

28 

55 

50 
61 

60 
74 

74 
37 

69 
59 
52 
62 
83 
62 



e 



fi. 

CO 



47 

61 

62 

80 

78" 

61 

57 

37 

37 

42 

12 

54 

47 
61 
48 

67 

61 
24 

74 
55 
62 
71 
74 
71 



01 

O 



47 
59 
62 
83 
85 
59 

56 
37 
37 
38 

11 

56 

47 

59 

53 
67 

68 
28 

70 
57 
54 
62 
79 
62 



11 
i 


1 A 1 


n 




CO 


1 










4J 


Of 




(0 




w 


a 


•M 












P 


1-4 


Cj 


C 


0 u 


O 


a z 








c , 
M 


c 


E 0 




u s 


IF* 


O 3 


0 




w 


c ♦* 


o 


ec 0 


o 


o -» 


H 


CO CO 


CO 




52 


54 


54 


45 


47 


45 


58 


69 


71 


48 


54 


64 


62 


43 


54 


63 


59 


67 


87 


72 


80 


QA 


f b 


QQ 
OO 


80 


81 


81 


87 


85 


87 


66 


49 


57 


45 


42 


48 


52 


49 




56 


51 


58 


40 


32 


35 


3c 


38 


40 


40 


32 


35 


36 


38 


40 








34 


31 


43 


13 


29 


18 


26 


19 


15 


62 


63 


64- 


50 


49 


61 


52 


54 


54 


45 


47 


45 


58 


69 


71 


48 


54 


64 


68 


64 


66 


59 


55 


54 


67 


48 


62 


68 


50 


72 


81 


89 


86 


61 


63 


73 


40 


58 


52 


43 


49 


45 


83 


77 


83 


no 


1 0 


m 
tu 


49 


45 


48 


• 

55 


51 


, 62 


43 


52 


45 


45 


51 


49 


71 


68 


71 


72 


67 


Tl ^ 


85 


87 


99 


79 


97 


83 


71 


68 


71 


72 


67 


71 












Q9 



Stanine Ranges for SASAS and Ability Test 

Percentile Rank Stanine 

99-96 9 

94-89 8 

88-77 7 

76-60 6 

58-40 5 

38 - 23 4 

22-11 3 

10-4 2 

2-1 1 



Q 93 

ERIC 



